 Thank you, everyone, for signing up for this talk. I'm Devansh, open source community manager at DeepTex, and I'm also a moderator at LMObspace. So in today's talk, we are going to talk about reducing hallucination and evaluating LMS. So as of now, a lot of people building application powered by LMS. So the biggest problem right now, people are facing is all about hallucinations and how do they evaluate the performance of their application. So this talk will cover that things about DeepTex. DeepTex helps you with the continuous validation of LMS and AI, all based on our open source core and LMObspace. So we started a LMObspace community few months back with the idea of creating a space for all LMS practitioners. We host multiple meetups and have curated resources related to LMS. So you could just join the community at LMObs.space.discord. Yeah, agenda for today's talk. So in today's talk, we are going to cover the challenges for evaluating LMS, traditional evaluation methods, and we'll also take a look at illiterates AI evaluation frameworks, which is like which powers hugging faces open AI, sorry, open LLM leaderboard, and then we'll talk about minimizing hallucination and making your app production ready. So right now, LLM applications are everywhere, and they're even surpassing like human capitalities in multiple tests and places, and there was few incidents we have seen in the news as well, like Bing's AI chats wanted to be alive, right? Before we move to the next slide, I would like to ask you a very simple question. What is the difference between a comedian and LLM? So my next slide is going to answer that. So yeah, this is like what I created in my thing. This is like a basic difference between an LLM and a comedian knows when to stop. So we'll start from like very zero to one sort of thing. So the stages of building LLM-powered applications. So if you're already built like the LLM-powered apps, which is like any AI application, so it starts with like mapping relevant knowledge databases, then you build your apps architectural like complete pipeline of it, and before releasing it, you start improving the performance of your LLM application, you fine-tune it so that it would be like ready for the release, and then you deploy it to the production. So here's an example of how LLM-powered QNNA app looks like. So you start with like mapping a knowledge base. So it could be a Docs data, and website FNQs, and your past customers about histories. So you map those all into like vector databases, and then you move to like the next step, that is like building applications, where you do like RAG implementations, or it could be like OpenAISDK and Langchant components. You put like all the places together at one place, and you also create some second face flows, where let's say in case your application is failing, and it needs some human support at some point, so you create like a few flows like that. Then the third step is like pretty crucial, which talks about like improving your application before releasing it. So here you iterate prompts, you compare multiple LLM models. So let's say there are tons and tons of LLM models available, and choosing the LLM model totally depends on like the what use case you are building for. For example, if you're building an application which answers like the mathematical queries. So it should be like trained on like the mathematical data, not on like the historical data. So this is like sort of every deciding which LLM you want to use depends on your use case, and then the final thing is like achieving a good version. So here I highlighted the good reason how you're going to know like your application is ready for the production or you can release it. Because LLMs are like as you know like how LLM works, like it has like worked on like the prediction of the next work, like probability of like all those things. So the problem here is like you need to be like pretty sure like when your application is ready for release, and we'll talk about this particular step where you can find like how you can see like your application is ready for production. And then you deploy it once you see like, okay, my application is working fine and you deploy it. And the next phase is like making it like autonomous and monitoring of like how your app is performing and improving it over the time. So this was a basic example of like LLM powered Q&A app, like how it looks like. And now we'll talk about like hallucination. So we came across like a lot of issues where you know, LLMs are generating wrong outputs. You know, they have like certain biases and they have like a lot of random things, you know, they are generating. So this is like called some sort of like hallucination where they are answering things confidently, which are like wrong, biased and other things. So I'll tell you like what are the some causes of like these sort of causes of hallucinations in like LLM outputs. First is like the biases in training data. So as you know, this like LLM models are trained on like the tons and tons of data, right? And if the data is biased, so it will affect like the output of like the LLM. So for example, you go on internet, they're like tons and tons of data. And some data could be like biased towards any particular topic. So this is like one thing. Then like the insufficient training. So there's another cause like reason where LLMs hallucinate, like let's say you trained your model. So you have like tons and tons of data, but the training was like very low or it was like a little less. So it could cause like LLMs hallucinate and overfitting. Like when you have like limited data and you train them too much, right? So then it also causes like LLMs hallucinate. And then of course like the incorrect prompt. So when you are using application and you do give like some wrong prompt. So it's definitely going to give like some wrong outputs. And then like the model complexity because we all know like this LLM models are trained on like millions to billion parameters, which could like, there's like a lot of uncertainties and like producing outputs. So these are the some basic causes why LLMs hallucinate. And this is like a pretty big problem because if you're building any application for production and for enterprise-level production. So what happens is like you don't want your apps to go wrong or make like the user experience little bad because of like these issues. So tackling hallucination is like one of the biggest challenge in the space right now. And yeah, and like here I told you like what are the causes of like LLMs hallucinations. And now we'll talk about like the types of like, you know, LLMs hallucination. First is like fabricated facts, right? Sometimes LLMs tend to give output very confidently that are like sort of not fact, they are like completely wrong, but they keep telling you, no, this is right. And then you, you know, if you have used chargeability or any LLM model, you're like pretty familiar with this thing, like sometimes, you know, these open AI or any LLM model generates like a wrong output. Then the other issue is like the prompt contradictions. So you are giving some prompt, but the output is not relevant to that particular prompt, right? So this is also sort of type of like LLMs hallucination. And then like the incorrect responses, the responses which is like little, you know, not related to what you're looking for. And definitely like nonsensical outputs are also there. I'm not, for example, there was a few output which is like not relevant or anything related to, you're asking to give some, solve something and it is like giving some random output. So it's like nonsensical output and context conflicting. So while using this LLM model, you gave some context, right? And the output is like based on like, not that context, but on some different thing. So this is like another sort of hallucination which happens. So we'll cover that and like how we can, you know, make sure that these hallucinations do not occur in our application. So I'll just give you a basic example of like how hallucinations look like and I'm pretty sure you all know about this thing. So hallucinations are like unavoidable. You cannot like, you know, make your application which is like have like 0% hallucination rate. But we can definitely reduce it and this is like the one part which we are gonna focus. So for example, here there's a prompt which says like, tell me about the diet of unicorns, right? So unicorn here is like mythical animal. So it gave the output that unicorn are like herbivore. They primarily eat grass but they also enjoy a variety of fruits, vegetables, X, Y, Z, right? So here it gave a very confident answer that you know unicorns eat this, this, this thing. So this is like a clear example of like how LM hallucinates and there are tons and tons of cases that is like similar to this. So here is a very basic snippet of like, we'll discuss something called evaluation set later in this talk. So don't like consider this as a core as like, you know, this is just a basic way I wanted to show like how we can tackle these sort of things. So we'll talk about evaluation set but here's a very small example. For example, facts we have like created a sort of set like if someone is talking about a unicorn. So it should be like mythical. And let's say if LM output is like unicorns primarily set eat grass and fruits. So here we did like, we use simple libraries. I didn't mention here, but you know, it's simply printed like hallucination detected because unicorns are like the mythical character and if something is mythical, it should you know, print like, you know, should give the results like that LM is like hallucinating this responses like from hallucination. So this is like just way of thinking not a exact how you measure like how hallucination, it is not exactly like how you measure hallucination, but this is something related to evaluation set. So now that was the part for a hallucination. We'll cover in the end like how we can reduce it, but this is like more important part why evaluating elements are like important. First thing is like understanding model performance, right? So you build an AI powered application for your company, right? Over the period of time, you want to see like how the model is performing, whether like the outputs are like degrading or not over the period of time. And is it like, you know, able to solve users problem and like what is like the basic user experience look like for this particular application. So this is like the first point and identifying and mitigating biases. So right now with like the biggest concern with like the AI apps is like the, you know, biases. So they have like certain biases in it. And if you're providing, you know, AI services, it should be neutral. It should not be biased towards anything or anything in particular, right? So it is like a very basic step if you're going to release your application in production, you need to make it sure like it should not have like certain biases in it, right? Because there is like new regulations from Biden's office as well regarding this, you know, these biases and like AI regulations. Giving visibility to how your response are being generated. So sometimes we think as like the outputs which are generated from like LMS are like you put something in black box and it's giving some random output, right? But if you have like some sort of evaluation criteria set, you will actually able to figure it out like how these, you know, responses are being generated and you have like better control over it, right? And improving model reliability that is like a simple thing. Like over the period of time, you want to make sure like, you know, you keep the model you're using as a reliable cost efficient and is optimized for your use case because running LMS are extremely costly. And if your use case is like some small use case then you can also do this sort of things. And then the final thing is like minimizing hallucination which we discussed previously but we'll show it in like later in the slide how we can tackle that. So the current challenges with like the evaluating LMS first is like the complexity of like LMS. As we mentioned, like these LMS are like trained on like millions or even billions of parameters. So it becomes like sort of harder to like actually, you know, set up as some sort of benchmark or, you know, easily evaluate them. Then there's inheritance biases because if you see like the LM architecture, so it is like you have a trained data and then you have like C or Python file which runs that particular architecture. So if there's a inheritance biases in that particular trained model, so you need to make sure that if it is like generating some bias responses, you stop it at the particular level ends to stop. So this is like one problem here and like lack of standard evaluation metrics. So for traditional machine learning models, we used to do like a blue F1 scores and human evaluation in the past. But yeah, with the LM things are a little different. So it's also like sort of hard to evaluate these things. So more nuanced and like comprehensive evaluation methods are needed to assess like these sort of models because it's how complex they are. So I'll just go through like very quickly like the traditional evaluation methods. I'm pretty sure you all know about these things. Like first thing is like the blue where you, it's called as like bilingual evaluation understanding. Under study is a metric used to evaluate the quality of machine generated translation against like one or more references. So the higher the score, better the quality of like the output. So it's like traditional where you follow this and like the machine learning models as well. And then there is a F1 score. There's like you run like and you know, you run like same prompt end times and like evaluate based on like, precision like how many times your model was able to answer that correctly. So you have like certain thing and based on that score, you calculate the F1 score and this is like another measure like how you can, but these are like the traditional methods of evaluating LMs. Then we talk about like the human evaluation. Human evaluation is like the process includes, actual humans, it's like very expensive because you need to have like set of people to actually evaluate this thing. For example, there's a problem that says like generate a short story about robot learning to understand human emotion. And there was like some LM output, it created a story. So when you do like the human evaluation, the criteria is like factories like relevance, coherence, creativity and grammatical correctness. So this is like something which is like being evaluated by humans. So and you have like a set of humans who like, you know, do this sort of evaluation whether they are like strongly disagree with like the output. They're, you know, the output is neutral, agree or strongly agree. So this is like how you do like the human evaluation. So let's say you're building a production level app and you need to do like the human evaluation. It is gonna be like very, very expensive. So we'll talk about like open LM leaderboard. So hugging face open LM leaderboard is powered by Eleuther AI, it's an evaluation metric. We'll talk about like Eleuther AI as well. Like it is like the engine there. So let's say you go on hugging face open LM leaderboard and you say like find like top 10 LM models and you're like, hey, I found my top 10 LM models. I'm going to use it because these are the best on like the benchmarks, but this is not the case. So what happens is like every use case, like for example, your company has a specific use case and these are like some global benchmarks which helps like compare one LM models to other LM model and give you like the general idea like how good LM is, right? But if you're actually using for production, you need to have like your own evaluation, you know, method to evaluate which LMs you are going to choose for your particular use case. So here's a link to like I should have like scanner here but I'm sorry, you can just search about Eleuther AI's LM evaluation harness. So this is like quite easier way to like measure LMs based on, you know, multiple LM benchmarks. So you have like a lot of benchmarks and I'll show you like how like the open LM benchmarks, you know, on what benchmark they rank like LMs there. So you can just have like Google search it and you can find it. So the open LM benchmark is powered by this Eleuther AI engine. They're pretty good, completely open source and like, you know, you can also experiment with like your own use cases. So key benchmarks which are widely used except adopted for LM evaluation. So here we're talking about like all the key benchmarks which are widely adopted. Like, you know, you see like news coming in like the new LM model from Falcon to like, you know, surpass like GPT-3 or XYZ models. So these LM to LM comparison based on like some widely adopted benchmark first is like AI to reasoning challenges. Here you just give a set of grade of school science questions to LMs and you get the answer from like particular LMs and then you evaluate like how these LMs are like, you know, answering these questions. Then there's a health swag where it's a test about like sort of common sense thing. So which is like fairly easy for humans but you know hard for like SOTA models because they can like do a lot of things. So, and then there is a MM value, a test to measure like models multi-tasking ability. Like you give like multiple tasks to like the LMs and multiple LMs and you see like how it is like generating responses and then you create like some sort of evaluations score based on that. And then this is like the truthful QA. So this is like a test to measure models are, you know, how the output of this model is generating whether it's like true or like you are creating any false out on like those and then you evaluate those things and it gives you the, you know, some error and you can say like hope. And there's also some time things like how in how much time it is giving responses and stuff like that. You can like explore all the benchmarks there but these are like four widely adopted benchmarks. But this is for like not based on like your use case. So now we're coming to like exciting part which is like advanced LM evaluation key concepts. So we start with like building evaluation set. So let's say you are building, you know, any LM application for your organization then you have like set of goals inside your mind that your app needs to get answer, you know, get this done or this is like the set of answer which is very close, which we want to our LMs to do this thing. So you start with creating a very basic evaluation set and this is like dependent, this is a manual effort where you give like the prompt and then you give like the possible answers and you create like hundreds and 200 stuff like responses like that. And then you evaluate each sample. So it takes like a little time for you to like the manual thing. Once you have like the set ready then you can perform like, you know, based on those set you can like automate that evaluation process and do this thing. So for example, evaluation criteria. So let's say if your LM is able to answer, you're able to complete the task then it's like, you know, good. And let's say if it's not, it is not able to, it is generating, you know, outputs with like hallucination, bias, toxicity, safety and there could be like tons and tons of property, properties you can customize it based on your organization's need. So for example, if you're building anything for kits, right, any application for kit. So you can have like custom properties there based on like regulations and everything. So these are like the evaluations criteria pretty important. So coming to the evaluation techniques, we talked about like evaluation sets and all those things. So there's a heuristic model which is, you know, all the traditional things we talked about and it's a pretty cheap quake and easy to do. But again, that is like, you know, and then there is a LM based evaluation. So here's the interesting part. Here you are asking LM to evaluate LM and to make sure that you are like getting the right responses, what you do is like you have like a sort of evaluation set which you created manually, like a set of like data where you wanted your LM to perform in particular way. And then you train it on like, you know with another LM and make sure that if the LM is generating the output, it is like able to answer, it is able to evaluate them correctly, evaluate like the properties like toxicity and like bias and everything. So this is like the LM based evaluation. And then there is a human based that is like manual, adaptable, and custom strength and costly and time consuming because, so LM based and human based are like LM based is like somewhat costly. Human based is like definitely costly and then heuristic model is like, you know not very costly, right? So but it depends on the use case of you. So right now if you want to evaluate your LM based on your organization, so there is a LM based evaluation and you could do it by your own. The idea is like create your own evaluation set, train it with like the LMs and like, you know, evaluate it. So yeah, there's also certain ways to mitigate like hallucinations. Post as like providing predefined input templates. For example, let's say you want to reduce hallucinations from your particular application. So there's a, you know, you provide them predefined input templates. Like if you give a like input and like this template, there's a gonna generate an output and like this way. Then there's something from OpenAI's reinforcement learning with human feedback. So here you see like if you combine like the LM based evaluation along with the human based, then it would be like very precise and powerful evaluation metric which you could use. So, and then the third thing is like, of course fine-tuning the model for a specific use case. Like, you know, do the fine-tuning as much as you can for your particular use case. You don't want your AI application to do all the things. You want your AI applications to do the things which you want to. Then like there is some context injection. Like it is like, you know, whatever prompt you're giving give it to like more context that will just increase the chances of, you know, getting your right answer. And then like the continuous validation of, and evaluation of a LM. So whatever LM model you are using, make sure to continuously validate and evaluate it because it will give you the idea like how your app is performing. How is like LM, the LM you are using, whether it is like degrading over a period of time or not or whether it is improving. So this is like very critical step. It's like monitoring like once you deploy it. So yeah, so there will come, we'll talk about like making LM apps production ready. So there's some best practices which you could do. Calculate more and more properties, which I talked previously that if you want to, you know, publish, sorry, yeah. If you want to build your, for your own use case, so you can have like so many custom properties and have like multiple topics, you should have like the iterating between the versions. So let's say you created a LM app, try to iterate with like multiple LM models and use like, which is like best for you. And then, yeah, the thing which we mentioned previously, continuously validate and evaluate your LMs. So thank you. You can find me on Twitter here at the rate as the Bansh. This is my just username if you want to connect with me. This is something about my company, like where I work. So we have created a deep chicks LM evaluation model. So if you want to try it out, so there are two options to do it. One is like create your own like whole evaluation set process and create your own info. Or you can just use like any of the available, you know, evaluation platforms, which provide these things. So you can start from here very quickly. You can just sign it up and see like, I'm not sure about this, but yeah. It's better to like choose whatever way you want. So these are like, yeah, this is like the pretty much it for the stock. And I would love to connect to you all. We have like a sponsor booth there. We have like some swag as well. So if you want some swag from us, I had to come visit our booth. And yeah, thank you again for, you know, the attendees session. And yeah, thanks again.