 All right, so since it's 250, we should probably get moving. You know, like, oh, if you can find a seat, I know there's tons of people in here. You know, stand against the wall if you have to. That's fine. I'm doing this for the camera. Yeah, oh, I'm so sorry. You're gonna have to come back later. Yeah, there's no room in the room. Boy, you really pack them in. All right, let's start this whole thing. So we're gonna talk about RAG. So introduce yourself, Guy. Hi, I'm Guy. So yeah, before this, before Join Alarm Index, I joined Alarm Index five months ago, I think. Before this, I was working at Apple, working on chatbots. And we started the team about seven, eight years ago. And you know, the thing that was sort of like happened this year is it's only the second time in my life that I went to an event and I'm like, holy crap, every single thing that we've built prior to this is completely obsolete, right? Like completely obsolete. We built these chatbots for WeChat and Apple Business Chat and we tried to do a good job. Obviously, you know, we tried to, we had a good team. And if you use them today and you compare it with ChatGVT, you're like, you know, which six-year-old made this chatbot that you built? Prior to that, I was working in algorithmic finance. And yeah, you know, like LLMs have really changed the game. And so I decided, look, you know, I have to do something different. And so I met Jerry and I'm like, okay, I'm going to join Alarm Index. So here I am today. Then he met me and life went downhill. Oh, man. Like, come on, do a talk with me, Yee. Fun story, I actually did a talk with Yee and his wife and their baby. We did a panel, all four of us. Not kidding. So those of you who don't know me, my name is Patrick McFadden, Apache Cassander Committer. I wrote a book that no one bought. So, you know, because apparently everyone just loaded it into ChatGVT. Yeah, oh, you're all doing what I expected. Yeah, there's as much as my LinkedIn if you want to connect with me on there. But yeah, so I like, I'm the guy for the video. I talk about Cassander all the time, mostly. Now I talk about LLM stuff because that's cool. Yeah, rag all the way. All right, let's get into it. So, this is our safe harbor statement. This is you. When you hear the word gen AI, all right, let's breathe together. Okay, yeah, because nobody knows what this stuff does. Yee does a little bit and he does it all day. But we know that you're probably a little stressed out, feeling a little behind, you know, the imposter complex. And then of course we all have a boss and you met mine yesterday who's like, oh, we need an AI, right? And all right, just be cool, right? Remember we're a community of people who figure stuff out. We're all smart engineers, let's work together. It is a cool thing, it is awesome. You just saw how I fixed my son's homework. So that's cool, I didn't have to do, I didn't have to build a math test. But we're just gonna work through this. And this is a place where you can ask questions and just have a good time together and not stress out. All right, let's start from the beginning. So Yee, you're up. Okay, so yeah, absolutely, totally echo what Patrick just said, right? Literally I started working with these models January of this year, right? So I guess sort of like a quick show of hands, who here has heard of Ragn? Okay, okay, all right, we got more than, that's like 80 hands, maybe a hundred. Wow, that's a huge crowd of a lot of hands. Okay, so, we'll do a little introduction anyway. So if we go to the next slide, we can. So, Ragn, okay, so, LMS are great at communicating and reasoning, right? They're not as good at knowing things, especially if it's not been part of their training data, in which case they don't know it. So what does Ragn do and how does that help us with this problem? So if we go to the next one, do you want to know what the next slide is? Sure, you know what it is, there you go. Yeah, so, what is Ragn? Like, retrieval augmented generation. When I joined LAMA Index, I saw these three words and I'm like, I don't know what these three words mean. I know what retrieval means, I know what augmented means, I know what generation means, but you put them together, what do they mean? It's actually a really simple concept, okay? It's really, really simple. You have an LOM like chat GBT, okay? What you need to do is you need to search for the most relevant information, give that data to chat GBT and then you get a better response out of it. That's it, right? It's like, if you're talking to anybody, if I'm talking to Patrick, I want to give him the most relevant contextual information so that he gives me the best response out. If I give him stuff that's not relevant, then he's not going to give me a very good response. Similar way, all you want is you want to give the LOM the most useful information. So if we go to the next one, right? So how does RAG work in the most basic way? You have a document and you split it up, right? So you have a document, maybe you split it up into 10 chunks. Why do we even split the document up into 10 chunks? We can just give the entire document to the LOM. Well, there's a couple of reasons. One, sometimes the LOM has a limit, right? It's called the context limit. They can't take more than a certain amount of text. Sometimes the LOM says, well, it's going to take me longer to process all of this information. Another important thing, they charge you by the amount of information you send to the LOM. The more information you send to the LOM, the more you get charged. And then on top of that, when you give the LOM irrelevant information, guess what, it gives you back data. That's not what you want. So what you do is you take the document, you chunk it up, okay? And then you find the most relevant piece of information. How do you find it? Well, you use data stacks, right? You find the most relevant piece of information. Or me, yeah. Or Patrick, yeah, or you know, you just find Patrick. I thought you were talking about me, yeah. No, you mean the database, right? You use the vector database. And then you figure out, okay, maybe I want three chunks. So that's what we call K. That's the math, it's literally the mathematical terminology. The top K chunks. And then you give that to the database. If you've heard of RAG, if you've used RAG, I'm guessing most of you have already done this. But let's talk a little bit about other variations. So the next thing we want to talk about is the... Yeah, what's the next thing? The search engines, right? So every search engine out there that does generative AI is also using RAG. So if you use Google BARD, it's referencing its database of websites and it's retrieving the most relevant information and then outputting the answer for you, right? So that's why it's like, hey, you know, I'm searching. That's what it's doing. Same thing with Bing, same thing with chat you do with browsing. So almost undoubtedly, if you've used any kind of modern LM application, you've probably already used RAG in the background. Okay, so next slide, we can try it, right? So I guess this might be a little bit hard to see from the back, but I used chat to you with web browsing mode and I said what 2022 movie won the Oscar, right? Good question. It did the web browsing search and it said the movie Coda won Best Picture Award at the Oscars in 2022, which is correct. They did, wonderful news for Apple TV Plus. And I said, well, what year did Coda come out? And they said Coda actually came out in 2021, which is also correct. So I said, well, is that a 2022 movie? I'm literally talking to chat you here, is it a 2020 movie? And I said, oh, actually no, Coda came out in 2021. So he realized it's a mistake, right? And I said, okay, so which 2022 movie won the Oscars? Right? That's a question that any of you could look up on the internet. And it said, oh, well, no 2022 movie released, sorry, no 2022 release movie won the Best Picture Award at the 2022 Oscars, correct. The award went to Coda, a film released in 2021. So this is a, everything it's saying here is correct, except for the first answer. And this illustrates an important point of doing RAG. Your output is only gonna be as good as the retrieved context input, right? So even with chat GBT, if you ask it this question, because it's not retrieving the right input, it can't give you the right answer, right? So if we go to the next slide, I think that's... Do you think you wanna guess? You wanna bet? I told you I really arranged your deck. Okay, no, all right, here we go. Oh yeah, that's garbage in, garbage out. Okay, next slide, I think it's yours, right? Oh, new to RAG, okay. So we do have a project, it's called RAGS. It's got like, this is music updated, actually now it's got like 5,000 or 6,000 GitHub stars. What it does is it uses an agent to help you build a simple RAG pipeline so you can actually play around with it and see what RAG does. So highly encourage you to check that out. Okay, now. All right, is it my turn yet? Yes. Yes, all right, so we have 20 minutes, I'm keeping track of you, all right? So we're doing pretty good. But what I wanna talk about is a quick example application because we're all engineers here, we like to get our hands on to something, we don't want it these theory. And I know that because you don't buy my book. Did I say that already? Yeah, you did. Okay, so yeah, you apparently wanna do hands on, right? So we have a little example app, oops, hold on. Why is this slide up again? So we have this, we actually use this in production on Astra. This is our, we use this to answer questions. And it connects with Intercom, it connects with Slack. And it's a pretty basic RAG application and it's meant to be that, but it uses Lumindex on the back and it uses Astra for the vector and it uses your own docs. And what's pretty slick is it's all available, you can use it today. And there's the URL and if you wanna take a picture of that, I'll wait and I'll get in it if you want. All right. Open source. Open source, completely open source, modify it all you want. But yeah, so I'm gonna show you real quick what it looks like and let's see here. So when you go to the repo itself, we have a Gitpod link and the Gitpod, if you've never used Gitpod, it's super slick because you can click on it and it'll open up all the code into a little IDE that's on the web and it works pretty good. Don't expect it to be highly performant, but it is pretty cool how it sets up like a nice little environment for you. You can run a server on it, that sort of thing. Think about like collab notebooks, how cool those can be. This is more for code, right? The free tier works pretty good with GitHub. I feel like I'm selling Gitpod now. Yeah, but it is really cool and it's great for like when you're sharing things with colleagues, because you can just put it on your repo and send it out. Okay, so that's Gitpod. There's a couple of things in here that are pretty interesting and I think, did I? Hold on, is this my old one? Which I had my exact keys from OpenAI. They got exploited. So you and I did a workshop and I put my OpenAI keys in there. Not 20 minutes later, I started getting traffic on it. Yeah, so those are like the new AWS keys, by the way, y'all just get rid of those things as fast as you can. Another pro tip, put a limit on your OpenAI, like put a rate limit. It will stop the Bitcoin miners. So we have it, there's something that I should just show you real quick on how this works. There's in the configuration, this is really cool. You could do, this is what you could do to like have some value with this project. Let me reduce my terminal window. What it'll do is it can, it will take like your doc pages or something like that in HTML link and it will go through them. It will do some chunking. We'll go through chunking strategies in a minute, but it does some chunking and it will break it up and vectorize and create the embeddings and then it'll make a chat bot around your own docs. We use it internally. We fed it everything. Matter of fact, Jeff talks to it all the time. Thanks, Jeff, just to give it more information. But really it's meant so that you, when you say, hey, I have a problem with the Cassandra operations problem that it won't tell you how to fix an Oracle database. Because that's kind of a chat GPT thing. Like, oh, my Oracle database or my Cassandra database crashed. Oh, well, you should go look at your Oracle config file. And I'm like, that's not what I wanted. But think about what you can do with your own docs. Especially if it's internal docs, right? Granted, there's about a bazillion of these out there, but it's a pretty cool little project and it walks you through a lot. So just wanted to show you how that works. And pretty simply, when you connect with Astra, ah, stay logged in. Like, I've had this thing running for a while. Oh, God, there's those more hackers in there. Man, what's up? But it's very simple. You don't have to set up a data model. You just put in a key. It'll create it all for you. It literally takes like five minutes to set up. So that's the quick rundown. We're not gonna go through the code. This is really meant to just show that this exists. How simple it is to set up. Go try it. You can find me when I'm walking around. You can hit me up on Slack. I'm on the ESF Slack. You know how to find me. I know you all do, because I know you all except for you. I've seen all your videos though, dude. You're awesome. You didn't think I saw you, did you? But anyway, I'll be happy to help you with any of that. So you, let's go back to you. Yeah. And I think Patrick brings a very important point here, which is that, like I was saying, you can build a very basic RAG system fairly easily, but there is no what I like to call God mode RAG, okay? There is no retrieval augmented generation where you can give it every single document of any type you want, and then it just works, okay? And people are trying to build it. I've seen OpenAI try to build it. I've seen Anthropoc try to build it. All of them currently still have limitations. And part of that is because humans are extremely creative, right? We have lots and lots of different documents. We like to write them in lots and lots of different ways. There is no one size fits all. So what I'm really excited about this project that Datastack's open sourced, which all of you should try, is that it really is what they're doing in production, right? So getting from that, hey, you know, I built a cool demo to this thing actually works and we actually are gonna use it for real production use cases for real users, takes some work. So- It also takes some moxie. Yeah, putting JNA apps into production, that's a bit of a leap of faith right now. Remember the original little statement, we're like, hey, this is new, you all should just be okay with yourselves? Yeah, we run into things like, oh, there's limits on open AI that we just hit, or- Exactly. Yeah, or we wound up diming out the server that was doing embeddings. I mean, we're finding new problems that we have to solve as engineers. Yes, yes, and it's very new and it's very rough on the edges. And don't be surprised if you change your prompt and it works fine and then you show your demo and it's like, my prompt doesn't work anymore and why? That happens all the time. On top of that, for RAG, there's a whole set of layered strategies that once again, we don't have time to go through today, but that would be useful in production. So you have 13 minutes, you can do this. Trust me, man, you got this. Don't we have to do Q&A also? That's the 13. So one of the things that people have talked about is like chain of thought. So there's different prompting strategies. So Gemini just came out today, Gemini released, they said, hey, we did really well on this MMO benchmark, but then if you look at the asterisks, they said, oh, we actually used this new type of chain of thought with 32 different examples. So there's lots of different prompting strategies out there. Chain of Thought is one of them. Microsoft yesterday, to sort of take away the thunder, said, oh, actually we came up with this other one, but it works even better with GVD4. Definitely this is active, active area research. A second one that we see a lot of use in production is this class of strategies called Small to Big. So it's you're embedding a smaller chunk and retrieving a larger chunk around it. So why would you wanna do that? Let's give an example, right? So if, let's say each of your chunks is like a sentence, which is actually the smallest chunk size that I would recommend, what's gonna happen is you're going to have, like what is the embedding model doing? It's trying to understand the intention of the sentence, right? So if you have that sentence, then it's gonna be a very concentrated intention, right? If you give it a page of text, then the embedding model is not, it's gonna be a little bit like sort of, okay, more generic. It doesn't really quite understand what exactly that page of text is saying, right? But if it's a sentence, you know, sentence where you have a subject verb object, it understands that, right? So you get a very much more concentrated embedding, concentrated intention in the embedding if your chunk size is smaller. But what's the problem with smaller chunk sizes? You may miss other useful contextual information. So to give you an example, like let's say if we said, okay, we're just gonna make every sentence its own embedding. We'd say something like, okay, Patrick McFadden is a software engineer. Okay, sentence number one. Sentence number two. He's giving a talk today at the AI Dev Conference. Sentence number two. Sentence number three. It says, it says, Patrick McFadden? I invented Java hash maps, yeah? I invented Java hash maps. Invented Java hash maps. There you go, so you got three sentences, right? Now, let's say the query came in and says, what is Patrick doing today? Okay? Or even better, let's say, what is Mr. McFadden doing today? Yeah, yeah, very formal. What embeddings is it gonna match? You know, the answer is in question number two, right? But most likely it's gonna match question, sentence number two, but most likely it's gonna match sentence number one and number three. Why is that? Because McFadden is a very uncommon word in your entire corpus, right? I love being an example like this. Keep going, that's great. So because of that, the embedding model is much more gonna say, well, your sentence has McFadden, your query sentence has McFadden, these sentence number one and sentence number three have the word McFadden, and sentence number two does not have the word McFadden in it. Therefore, I'm gonna say, your query is more similar to sentence number one and sentence number three. So this is where the small to big type strategies come in, right? Because by saying, okay, well, we got sentence number one, but let's get some stuff around sentence number one, right? Let's get some more information around sentence number one. Or instead of just taking that one sentence, then you can get the actual information which is in sentence number two. So small to big, it's very, very commonly used in production. I highly recommend you check it out if you've been on production system. Then we have sub-question decomposition. This is also something that is very commonly used, which is like you have a sentence and it can have multiple parts, right? You can be like, you know, what's a good example? We say, hey, what happened at the AI.dev conference and at the Cassandra conference, right? So a query decomposition can say, okay, well, we're actually instead of trying to match this entire sentence against the example, we can say separate that into the Cassandra conference and the AI.dev conference. So, and then we also have like re-ranking, which is another strategy. It's a little bit more complicated, but basic idea is when you get the chunks back out, you can then re-rank and choose the most relevant chunks based on seeing the chunks that you've already seen. And then we also have stuff with multimodal, which is people are very excited about, you know, Gemini today came out. We have day one support for with Gemini. So all of these things are part of Lama Index. Unfortunately, I don't have time to show you the code or examples today, but you can check them out. It's in our documentation. And if you go to the next slide. Yeah. The most important thing is that this is all part of the open source community, right? So Cassandra's open source, Astra's open source, Lama Index is open source, right? The most important thing is we are working on this together, but we're actively researching these ideas. And for all of you, the most important thing that we can do is have you contribute, right? Help us use your data sets, use your use cases and help us understand how to do this stuff even better. So I think that really caps us off because I bet you, I was thinking you're right, we should leave a lot of Q&A time because I bet you all of you, this is a lot of this is new, especially this slide right here is like the money, right? Because when I first started doing RAG stuff, I was like, there's one RAG to rule them all, right? No, no, there's like, and you all keep adding them. There's like, this is the thing I love about Lama Index is they just keep adding stuff. Here's a pro tip, if you use the Lama Index library in any project, lock the version. Because I guarantee you the next minor revision will break something because they just keep moving forward. Not a bad thing, right? Well, we're gonna fix that. Yeah, I know, I know. We're gonna fix that. But it's a fast moving project and there's a lot of new things happening. So yeah, I think let's, why don't we stop and do a Q&A because I think that might be the most helpful and you've got Yee here. He's been doing this for like months. Yeah, exactly, months. Way more than years. Months of experience. Yeah, yeah, he's way ahead of everybody, right? You gonna start this off? All right, go ahead, what's up, Raul? Well, what's the call, really? Yeah, yeah, yeah. Yeah, that's an old Cassandra joke back in the day. It was like, we're looking for Cassandra engineers with 10 years of experience. Nevermind. So, good thing you asked and I would separate that into two things. Do we have another data filtering? Yes, we do. But the second thing is to your point, right? So we have like two layers of the API. We have the high level API and we have the low level API. And in our experience, most people right now are using the high level API, right? So the metadata filtering and some of these more advanced features that I'm talking about today, right? They're part of the lower level API and we've really been trying to publicize that. We've really been trying to, we had the series of like, Logan has a web series where he's talking about the low level API. Jerry has a series about building RAG from scratch, right? All of these really trying to say, hey, you know, like look at the more advanced features. So metadata filtering is actually one of them. Right now, it's not as publicized as we need it to be. So that's one of the things that we need to do also. Community education, absolutely. That's why you get a t-shirt. Next question. Anybody? Fine, Jeff? Yeah. I'll ask my question. Please do. Absolutely, absolutely. So they absolutely build on top of each other. And generally the way we see people building these systems in production, right? Is trying one and then saying, okay, well, you know, looking at the data and saying, hey, you know, what am I missing? Right? So the first thing you can do that, you know, even before this, the first thing you can do is there's a slider that says, hey, you know, what is my chunk size? So the first thing you can do is adjust that, right? So you adjust the chunk size. And then you're like, well, you know, my embeddings are kind of all the same, right, it's not really getting the, and then you say, okay, you know, maybe I'm going to do the smaller big, right? Or you say, look, you know, I really need more examples because the LM doesn't really understand what I'm doing. So that's where the chain of thought comes in, especially with the chain of thought KShot. So yeah, they're layered strategies on top of each other. I really wish I had put that in the slide, but OpenAI actually, you know, their solutions engineering team actually had this great slide where they said, like, you know, we got this level of accuracy with just naive rag, I think it was like something around 50%, right? And then they said, look, you know, by adding on this other strategy, right, we got to this level. And then by adding on this other strategy, we got to this level. And then they said, you know, where we ended up with the customer was at 98%. Okay, I don't necessarily think that every application is gonna get to 98%, but that concept of layering on strategies really works. And actually, you know, after we saw that, we said, okay, you know, how many of these do we have in Lama Index? The answer is actually we have all of them. We actually have implemented all of these additional strategies, but you really do need to look at your data and layer on the additional strategies. Max, we had a question. Yeah, so very, very good question, right? So we have a volunteer doing load testing. Yeah. Good job, Max. Good contribution. So does the Python code scale to millions of documents? Yes. Do the strategies scale to millions of documents? Not yet. We are actively working on this. This is very, very interesting. It's a very hard problem if you think about it, right? Basically, ultimately what you're kind of trying to do is you're trying to replicate Google, right? You're trying to redo the search problem, but even better. And embedding search does help you do that, but there's some other steps along the way that we are actively working on. Yes. Absolutely, absolutely. So I don't know this one either, because we talked about this before. It's like, how do you know that you got at hallucination? Yeah, so there's, once again, very, very active area of research and development, right? So what we call evaluations, very, very active. We have some built-in evaluation tools you can use with Lama Index. By no means comprehensive. The rest of the industry is also working on evaluation tools and there's companies that are working on this also. On top of that, there's this question of, do you have the LAM do all the evaluation? So our evaluations are done with LAMs, but sometimes you might want to have human evaluation also, so how do you integrate that? So very, very active. We definitely check out our evals page about the evaluation strategies we have, but yeah, I mean, like if you have a new idea, definitely send it all the way also. All right, we got room for one more question. Yeah, I'm getting the wrap-up thing. You're gonna bring out the hook? Please don't. All right, one more question out there. Oh, no one wants the final word. All right, great. Go to the next talk. Don't hang out in the hallway. No, hang out in the hallway. They're all recorded, so you can just talk all day. It's okay, but thanks a lot, Yee. Thank you so much for the invite. And you know what? Here's the thing, he didn't know how I crapped him. I totally messed up his deck and he did an awesome job, so let's give him a hand. Thank you. You did good, man. Let's play presentation jumble again sometime. Okay, cool. Happy to, thank you. Thanks, everyone.