 Hi, everybody. I'm Laurie Voss. I am VP of developer relations at Llama index I've been a developer for 27 years now For a while there. I was co-founder of npm Inc. So you may know me from JavaScript world But now I talk about AI Specifically today, I'm going to be talking about retrieval augmented generation or rag which you heard Jerry talk about in the keynote We'll cover what Llama index is first just to get our bearings And then we're going to dive into the stages of any rag application ingestion indexing storing prompting And querying which is the chunkiest part Then we'll go into some in-depth examples of various querying strategies that might be useful to your application What I'm hoping that you'll take away from the talk today is what a rich field this is There's a ton of different ways to do retrieval augmented generation And depending on your use case you're going to need to pick your best strategy You'll obviously also learn how to do all of this stuff in Llama index specifically. Otherwise, what would be the point of this? But first just in case a reminder what is retrieval augmented generation the core idea behind rag is that LLMs have less space in their prompts than you have data You have mountains of data and they can only have limited context windows So you can't get them all your data. So every time you have a question you have to retrieve the most relevant data to your to your application Sorry to your query Augment to your prompt with that data and then generate the answer So what is Llama index It's a framework for building rag applications. It is open source and free. It is available at Llama index.ai We have available versions available in both Python and TypeScript for the JavaScript fans out there Llama index also provides a registry of software called Llama hub The hub provides a huge library of software to help making help make help making build help making Rag applications better. Thank you Not sure why that sentence in particular trip me up The hub includes things like connectors to connect to your favorite data source tools for building agents Data sets for evaluating your applications and a set of things called llama packs Where as which are essentially arbitrary bundles of code which take complex problems and usually turn them into one-liners I'll mention these again as they become relevant Thirdly and most importantly llama hub is a way to get these applications into production We are not about building demos in notebooks. We are about building stuff out in the real world There's a lot of ways that we facilitate that one that got a lot of traction recently is a command line tune called create llama It is loosely based on create react app if you ever use that It gives you a menu of options and it creates a full stack front-end back-end application that you can deploy in one Click to Vercel or render or a service like that The back end is available in TypeScript or Python depending on your on your preferred flavor By default llama index uses open AI as the LLM, but we integrate with every every LLM that we could find out there Open AI is easy and everyone knows it, but there are more than 25 different models and APIs that we connect to Including local embedding models if you don't want to use an API We also support over 30 vector databases obviously including Cassandra Including everyone that you've heard of and probably some that you haven't And we also support hundreds of data sources more than fit into a data more than fit into a slide File systems every file type you can imagine Google Drive Dropbox Notion Slack GitHub S3 Discord every database that you can imagine We will connect that to your data and then connect the data to the LLMs We are an extremely batteries included framework In addition to all those connection options We provide advanced retrieval strategies that work out of the box as I'm going to show you today That includes agents that is semi-autonomous actions and multi modality That is images as well as audio and video in addition to the text that you're familiar with We also have integrations to support observability an array of tools to allow you to do evaluation of your models I can't possibly cover all of that stuff today. So today. We're going to be focusing on advanced querying strategies So first let's recap the stages of a rag application First you get your data wherever it is and you load it in using those connectors that I was just talking about from Lama Then you need to embed them in vector space so that you can find the most relevant data and that's called the indexing phase Then you need to store all of your embedded data somewhere That's where all of those vector stores that I just mentioned come in and that's obviously called the storage phase and Then there's the phase that we call querying which is the most complicated and chunky of the phases Because it's really three things in one first. You have to retrieve your data. Then you have to Combine the retrieved context with your query in this phase we called synthesis Then you combine that with your prompt and you send that to the LLM After the LLM results returns results, you can further post process what it's doing We're mostly today going to be talking about the retrieval stage But in order to do querying we're going to be talking about all three of these stages over and over So let's start with all of the phases of in of a rag application. Let's start an ingestion The very simplest form of ingestion is just to load a bunch of files on disk into memory directly In Lama index you can do that with one line of code This example I'm showing simple directory reader can handle a huge variety of file formats So it can do CSVs PDFs word files also images audio and video For when you want to do multi-modal things And if you're happy with our default settings, then you're done at this stage You can create an index and you can start querying it But you're probably unlikely to be written to be happy with it at this stage because this is the most naive rag Stack that you could possibly put together what you want to do is you want to customize it according to the State of your data and the shape of your data In which case you want to build an ingestion pipeline This lets you configure a series of transformations that happen to your data Including how it is split whether what you're extracting extracting metadata And what embedding model you use and the result is a set of nodes that you can then pass to the same vector store index That we saw in the last slide and of course in production. You'll probably end up running the same pipeline over and over You'll want to experiment with your parameters and also your data will be changing as you go So we allowed you to cash your ingestion pipeline. This is naively piping Caching it to disk, but you can also cash it to a variety of databases So when you rerun your pipeline it reruns only the parts that have changed and it recalculates only for the things that have that are modified So with those three phases done yet brings us to indexing as I mentioned This is the stage where you take your chunks of data and you embed them into vector space We support a huge set of embeddings just like we support a huge set of everything and These all have different performance in terms of both speed and quality of retrieval The workhorse here is a class called the vector store index Which takes care of getting all of the embedding done whether you're talking to a local LLM or you're talking to an API You'll see us call out to vector store index over and over in the examples coming up Depending on your use case. You may also want to check out our knowledge graph index This takes unstructured text and splits it into entities and relationships And then performs entity based queries on that data, which can be very useful depending on what kind of use case you have So now you have all of your embeddings and it is time to store them Vector stores are all trying to do roughly the same thing They take giant piles of numbers and allow you to search against them using vector math However, there's a lot of differentiation in the space, especially in terms of performance as we heard this morning And as we're going to see later Also in which ones allow metadata filtering and hybrid search. Don't worry. I'm going to explain what those things are So that brings us to the querying stage and the thing that we're not going to be talking about very much Which is prompting Other rag frameworks talk a lot about prompting and prompt engineering, but we take a different view We are a batteries included framework. We have created prompts that are already pretty good We have created prompts that work well with the specific LLM that you are talking to and we believe that We should be able to provide you with an excellent experience of prompting right out of the box So you don't have to figure it all out of yourself Of course if you want to you can still modify the prompts The most obvious thing that you can do is you can get a query engine And then you can get the prompts from the query engine to find out what it is that we would be saying to the LLM on your behalf And you can pass in your own you can change your prompt to be whatever you want The syntax of prompt customization is going to be very familiar to you if you've used any other framework for doing prompt customization before And That brings us to basic querying Let's start with the most basic form of querying We just get our index to give us a query engine We accept all the defaults and we just run a query. This is great This is part of why LLM index is so popular and so easy is because it is so easy to use You can get all the way from loading your data to running a query and getting a response in five lines of code if that is your bag Of course like everything in LLM index your query engine can be customized What we're doing by default is retrieving the top two most relevant pieces of context from your data And returning them as the context for your LLM to to for your query But you can customize this as I'm showing here. You can set it to five or any other number and Retrieve a different Volume of context this is gonna be very important if you've split your text up into lots and lots of extremely small chunks You're going to want a much larger set of retrieved context You can also configure the synthesizer. That is how the engine puts the query together Before it sends it to the LLM There's a bunch of available strategies here, and I'm afraid that we don't have time to go through all of them But effectively what they all do is group your chunks of context together And query the LLM with them Finally, you create your query engine by combining your retriever and your synthesizer and querying the LLM So that is the basic querying that is all the way from loading your data down to getting your response So now let's talk about not the naive strategy. Let's talk about leveling up a little bit the first one we're going to look at is Conveniently packaged up for you as the sub question query engine And the problem we're trying to solve with the sub question query engine is complicated questions Questions that have multiple parts questions that are not just a simple yes or no and Or may require getting data from more than one source So what this engine does is break your query up into a series of simpler questions using the LLM It gets the LLM to say what would be the simpler questions and then splits those up Then given an array of data sources It routes the query to each data source appropriately and combines the answer from each data source into a single answer In this toy example, we're just creating from one data source. Obviously. This is an array You could pass as many data sources as you wanted What we've done is we've created a simple query engine That talks to Paul Graham's essay one of our favorite examples for some reason and We're assigning metadata to it which specifies what this tool should be able to do and what it's called The sub query question looks at each of the available tools and routes the right the question to the appropriate simple query engine You can have as many sources in here as you want The next problem you want to tackle is precision basic rag can struggle with making sure that the context is Precisely related to the query one tactic for handling that is known as small to big retrieval I mentioned earlier that you might want to split up your content your data into very small chunks This is one of those situations where you'd want to do that You can break a very large document document up into for instance single sentences and do retrieval on them So you get back this very act this very relevant single sentence But then the LLM is trying to answer a question about a single sentence and that's not enough So instead you say give me the context around this sentence So it's retrieved this sentence and then it retrieves the sentence the five sentences before and the five sentences after and gives you a larger window Getting this done in llama index is disarmingly simple because again, we've done all of the work and package that up for you When creating our query engine we can supply a list of node post processors which work after retrieval But before synthesis and we use the catchally named metadata replacement node post processor With a target meta with a target metadata key of window to automatically search through nodes the nodes know Which node comes before them in which node comes after them so it can automatically do that search for five before and five after Before it sends the query to the LLM Small to big is Precision by post processing nodes after retrieval you can also get better precision by pre-processing which nodes you're doing the retrieval on One way that you can help the LLM out is use existing metadata about your nodes and filter on that before retrieval in In addition to using embedding retrieval some vector data databases such as Apache Cassandra Allow natively support attaching metadata arbitrary metadata to each of your embeddings So you can attach basically keywords to the chunks of data that you've embedded and you can then filter on those keywords This is particularly a really great example of this is if you are doing something that is based on time And you want to tag it with the year that this data comes from as in this example you can Tag your chunks with a year and pre-filter what you're doing based on that year But that sounds kind of annoying because that's you doing stuff that's you doing stuff in code And these are supposed to be LLMs. They're supposed to be smart So you can use the vector index auto retriever to do that work for you We can describe each of the metadata fields and say what they are for and the LLM can decide what Metadata filtering it should do in advance before it does retrieval So this allows you to continue your queries in natural language without having to do any manual refinement on your part As you can see from this list most of the vector databases that we support also support metadata filtering only a handful don't so That's really good news for you our next example of it of an advanced search technique Sorry retrieval technique is hybrid search Embedding based vector retrieval is Truly magical it's searching by meaning but we have put decades and decades of research into traditional search engines And it turns out that they are also really good So you don't have to pick one or the other you can combine the two Some of the most interesting vector databases are actually search engines executing a pivot so they allow you not just to do You know top K retrieval they allow you to do Use existing search algorithms one of the most prominent of which is called BM 25 Once you've got a database that supports hybrid search using it is again extremely easy You just pass this method called this parameter called vector store query mode hybrid and Then you pass a parameter called alpha alpha is the balance between Traditional search and vector search Zero being entirely traditional search and one being entirely vector search As you can see from this list of vector databases the group group that support hybrid search is much shorter Not on this list is VESPA Not because it doesn't support hybrid search, but because we don't support it yet We're getting around to that very soon as an ex yahoo myself. I will always have a soft spot for VESPA Another use case for advanced retrieval is complex documents So imagine the document that isn't just text a document that contains charts it contains tables Let's consider how you can break down a complement complex document into a set of simpler queries But first I need you to introduce you I need to introduce you to our ability to query tables in the first place We have a fantastic built-in called the pandas query engine You can create a pandas table in Python and using the LLM It can generate correct pandas calls that query those tables to get the right values for your data We can then provide a description of each table either from ourselves or we can get the LLM to generate the description of the table And create a series of what are called index nodes. These are like regular nodes that I that you saw me passing into the vector index earlier But the retriever will that I'm about to introduce will know how to treat them specially Each index node contains an index to a dictionary that we're creating here The dictionary consists of instances of pandas query engines one for each table in the document for completeness I'm showing here the intermediate step, which is Parsing the rest of the document Which you can from which you can assume we have already extracted the tables We create a vector retriever that operates over all of our nodes both the regular nodes and the index nodes And now we set up a query engine just like we did earlier a retriever a synthesizer combined into a query engine But this time the retriever that we are using is a recursive retriever the recursive retriever expects a list of retrievers to use So we give it the vector retriever that we just created It also knows that if it retrieves an index node those things that I was just talking about instead of just returning that node It should look at that node as a tool look it up in that dictionary that we created just now and Perform a search against that tool to retrieve additional information So if one of those nodes is a table it will perform a pandas operation on that table to retrieve the exact result before passing that back to synthesis to do further querying a Very common advanced retrieval case excuse me a Very common advanced retrieval case is wanting to query a sequel database There's really no end of useful use cases for sequel databases, so I won't belabor the point Let's break this down and show how it's done under the hood Lama index is using sequel alchemy Which if if you've ever done data in work in Python, you'll be familiar with Here we just connect to a database and initializes Lama index's own sequel database class The sequel database class is handled by the nl sequel the natural language That is natural language sequel table query engine another really catchy name from our team This is another built-in from Landac from Lama index what this does under the hood is Pass the schema of the table as part of the query to the LLM and gets it to generate sequel This is obviously tricky stuff. Not every LLM is up to generating correct sequel It works great on GPG for and things of that caliber So if you know the name of your table in advance and you're sure the schema is going to fit into your prompt This is an incredibly simple way to get that done You can see how it worked well into the recursive retriever that we were just talking about you could create one of these For every single table that you were talking about having created them having split them into sequel tables But what do you do if that isn't true? What if your table schema doesn't fit in memory or you don't know which table to query in advance in that case? You can use the sequel table node mapping. I love pronouncing these names out loud As we're doing here you create an index to each of the tables that and you turn them into a retriever Which you then pass to the sequel table retriever query engine Yet another amazing built-in and yet another great name This will search the index for the most relevant table and then pass it to the Natural language sequel table query engine and perform the same query as it did before so this is very like the recursive retriever It's routing the query to the appropriate tool How does it know which table to query it uses the LLM and it looks at the schema? It looks at table names and it looks at Column names. However, you can also give it hints by passing metadata about the tables to the query engine and Given what you know about tables versus what the LLM can induce that's probably going to be very useful So that brings us to the final use case that we're going to cover today Which was also the most complicated because it's not just a retrieval strand strategies an example of an agent Such as Jerry was talking about in his keynote this morning Our favorite example of this which Jerry also mentioned in his keynote is SEC insights.ai This was initially intended as a Demonstration of what you could do with LLM index then we open sourced it so that other people could build it And I'm told that in the latest Y Combinator bachelor is the company who's who is doing Insights into financial documents that which is what this does as their entire business model But for us it remains just a demo Our first step is to create a query engine for each of our data sources here I am very stupidly creating three very identical data sources that are just different years of the same data But they could be entirely different query engines They could be in any of the query engines that we talked about any of the strategies that we talked about earlier Now because this is an agent We define tools to give it that it can select from as we did earlier in our sequel example And we give the tools metadata so that the LLM can decide which of these tools is going to help answer this question best Now we define our agent which is again surprisingly simple we give it a nice capable LLM like GPT4 Which is capable of tool use otherwise this whole thing doesn't work and now Now the agent when given a query will enter a loop where it tries to select the best tool Runs queries over that tool and continues doing that until it has an answer to the question in this contrived example Our query is very simple and our tools are very simple But remember you could have made any of the above examples about that we talked about into One of the tools that you were using here The key keyword here is composability all of these queries can be all of these query engines can be turned into query engines That the other query engines then know how to use Until you have a fantastically complicated and fantastically capable rag application So that's pretty much what we've covered today We talked about what LLM index isn't is it's an orchestration framework But also a hub of tools and connectors and finally a set of tools for getting into production We covered the stages of rag ingestion indexing storing querying And then we talked about LLM indexing support for pipelines and pipeline caching how indexing works The set of vector stores available to you and what features they support and we briefly covered prompting And how you can customize your prompts Then we went into our examples We started with naive top K retrieval just basic rag And then we went into the sub question query engine for more complex engines small to big retrieval meta data Vildring hybrid search recursive retrieval text-to-sequel and multi document agents I hope this was all this was useful to you and your journey into AI We are still really early at that is the thing that I keep learning Everybody who is doing this has been doing it for a time that is measured in months not years So if this was all a little fast or maybe a little overwhelming for you don't sweat it I hope it gives you a sense of the amazing things that you can do with LLM index And I look forward to seeing the amazing things that you build. Thank you very much for your time and your attention