 My name is Alan Ho. I leave our ADI strategy as well as research at DataStacks. And prior to this, I was at Google Research, leading the product management for quantum computing. And I've also contributed a lot to the TensorFlow community as well, specifically on probabilistic AI. So it's a great time to be a developer these days. There are so many new tools that you can use. In fact, it's overwhelming number of tools. LangChain alone has 500 software integrations. And with choice comes a problem, because you end up wondering, oh, which projects or software should I pick and choose for my stack? But that's actually really not the right question to do, because it's impossible to pick and choose the winners and the losers. It's better that you understand how the stack will evolve and create a mental model by which you can pick and choose these software components. So today I'm going to talk a little bit about how the stack will likely evolve. I'm going to be talking about data from the latest AI initiatives, specifically some of my ex-colleagues at Google created this project called Gemini, a very large language mod. I'm going to talk about that. And then I'm going to talk a little bit about the five major workflows for generative AI applications and the stack associated with those five major workflows. I also have a paper here that we're publishing next Monday on our website. You guys can grab a copy that gives all the citations and papers and resources related to that. And I want to say that this is actually more of a collimation of multiple people who has given conversations from internal employees, data stacks customers, our partners, as well as investors, and then analysts in the industry. So let's give a context to Gemini. I want to start out with Gemini. So Gemini is the latest large language model that was created. And this was created by Google. We don't know how many parameters it has, but it's rumored to have five times the number of parameters and 20 times the number of tokens than GPT-4. It actually leveraged all the data in YouTube, apparently. This is allegedly, in order to build this model. And as a result, that has generated the state-of-the-art benchmarks for many, many problems, from coding to math problems, et cetera, et cetera. So you can see here Gemini Ultra compared to GPT-4. It's winning in all of them. The other area that is very interesting about Gemini is that it's natively multimodal. And when I say natively multimodal, there are a lot of machine learning models that takes images, translates to the text, and does LLMs on it. Natively multimodal means that internal in its representation, it understands text, it understands audio, it understands, and it may even understand video. That's actually a little bit hard to know from the paper, technical report, but at least images, sound, and text, it understands natively. So a problem like this physics problem, it's able to comprehend, reason, and give the right answers. And in fact, in this particular use, this case, the student got their physics problem wrong. So this is a very, very powerful model. So kind of given these advances, where is the future of these AI systems likely to go? And the way to think about the evolution of AI is that you can use the brain itself as a mental model. And there's two very important components of the brain that is being mimicked by the artificial intelligence community, namely the separation of memory versus model, and the idea that an agent consists of multiple models communicating and working in tandem with each other. So let's talk about separation of memory and models. There are two ways to represent knowledge in an AI system. One is you represent the information explicitly as facts in a database, or you represent the information implicitly in the weights and biases and the connections of a neural network. Now, most of the energy that you've been hearing about from these large, from the hyperscalers in particular, an open AI is focused on the large language model, is focused on number two. But you need to actually have both. You wanna actually have both. And it's more about how a system that represents information explicitly and how information is represented implicitly worked together that gives you the true power of AI, just like a human being. You might have heard of RAG, right? This is an idea of before you actually send it to an LLM, you query a database, and then you send the information to the LLM and it gives you an answer. And there's other mechanisms, there's another technique called research and revise, where you get an answer from the LLM, but then you fact check it against all the facts in your database, so they work together. The second is, instead of one big model that gives you everything, they consist of multiple smaller models that actually interact with each other. This is in neurosciences called the action perception loop, and it looks something like this. Information from the outside world is sent to a perception model. So this is a smaller model that just focuses on understanding what the current state of the world is. From that, the perception model passes the current state to what I call the actor model. The actor model creates plans, it creates a discrete steps of things to do to achieve a particular goal. That model is that the plans plus the current state is sent to what we call a world model, which simulates what the future will look like. And then based on the future state that's predicted, another model called the cost model, or if you use the auto GPT language, the critic will look at the plan and see whether or not it meets the objective that the system has been asked to do. If the cost is too high, it goes into a re-planning loop. So it recreates the plan until it either exhausts the, it either gets too low enough of a cost that the AI system is ready to give an answer, or it exhausts a number of times it can try. And at that point, once it has a plan, then these AI systems will take action in the world, namely through API calls. And so you've seen that in a lot of the presentations earlier, they talked about tools in open AI, it's also called functions, tools or functions, that's taking action in the world. So this is probably where the, this is where agents, AI agents are moving towards, and we're likely to see specialization in smaller models that can perform of each of these tasks. So again, example is Skypoint, this is one of our production customers that Chekapur talked about on healthcare. They have data that's both structured, that's living in Databricks. And by the way, Cassandra is an awesome database for storing structured data too. And they have unstructured data that's stored in Cassandra and leverages vector search. And what they do is that they have their agent system, the first thing that when you have a query, the first thing it actually decides is whether or not this data is unstructured data or it's structured data or a combination of both. If it's structured data, it goes through a more, these multi-step processes. So for example, it would first realize that needs to fetch the tables that are from, let's say Databricks' unity catalog to get all the schemas out. So it's making API calls, taking action in the world, getting that information, and then replanning and figuring out whether they're right tables to use. Then what it would do is that to generate the actual SQL query, once identified the tools, it might use some techniques like few-shot prompting to generate the SQL statement. And then it may use the SQL statement, it may reflect to see if the SQL statement generated actually matches the original query. So these are fairly sophisticated AI agent interactions that you can see. Same with unstructured and then there's actually another talks later tomorrow, that talks a little bit more about how do you take away single query and break it up into multiple queries. So what I'm trying to illustrate here is that the future of AI, these advanced RAG techniques, they become much more interesting and feasible with orchestration underneath the hood. So what we're seeing here is that we're gonna create a new AI stack. They're coming from orchestration, these are things tools like LangChain, LamaIndex, agent hosting like Vercel or AWS Lambda, monitoring, for example, weights and biases, you have all your training systems and your LLMs like OpenAI, et cetera. But I think the more interesting area is actually the data infrastructure system. And you have, and I'll talk a little bit more about this in detail in a few minutes. But it's also important to think about the major workflows that you're gonna be going through. So the five major workflows for generative AI is model training and fine tuning, data processing slash embeddings, constructing the actual prompt, executing the prompt, and then in a world where we have multiple agents working together, it is multi-agent orchestration. And I'll go through some of the tools that are used for each and how I see them evolve. So the first is model training and fine tuning. To do model training and fine tuning, these are inherently batch processes. So you need to have batch data pipelines, you need to have a data lake in order to execute against them. So traditional data lakes are things like Spark and then batch data pipeline processing platforms are like Spark. But we're seeing a lot of new other types of pipelines popping up, unstructured.io, LamaIndex is a lot of interesting batch work. LamaIndex combined with Ray actually is also another very interesting platform to try out for batch processing. And a lot of people ask me like, should we do RAG or should we do fine tuning? And the answer is you wanna do both, right? And the way you think about it is, as your application is interacting with the user, it's building up training data, right? And while it's building up the training data, it has very few examples, you might wanna put it into a knowledge base explicitly and do something like RAG on it. But as you increase the amount of training data over time, the number of examples over time, you're gonna have enough examples to fine tune a neural network so it could perform something very similar. It's just like human beings, right? We collect information as we interact with the world and then we sleep on it and then our brain just gets rewired based on the information that we've, the examples and interactions that we had over time, over the last day. It's very, very similar to how humans operate. Before I go into the other workflows, I wanna also introduce this new concept called the system of context. So in the web world, we all know about content management systems, right? You got all your webpage content, you got webpage templates, you have this whole like very complicated pipeline for getting raw content into web pages and publishing on the internet. LLMs and generative AI has something very similar and what powers an LLM is having the right context. So what I mean by that is what I see is that over time, there's gonna be new primitives that are very similar to the web world and I'll kind of talk through them. So one primitive will be enterprise context. This is kind of like all your PDFs, your documents, as well as all your structured data that's in your data warehouse, things of that sort, things about the enterprise. That's fairly obvious that you could do rag on. But the other abstractions that have yet to be very much formalized are things like the chat history or the agent memory. Whenever you're interacting with these LLMs, it itself is generating new information and you need to have that information in order to put that into the context that you send to the LLM. Prompt templates. So we'll talk a little about prompting but these prompt templates can get very sophisticated so that you can make the LLM do what you want and then caching. These API calls to these LLMs are very costly. How can you use caching in it? And all these new abstractions underneath the hood, they're powered in some way or form probably by vector databases. Ideally, it's a vector base built on top of Sandra. All right, so let me talk about data processing. So as we talk about data processing, there are these kinds of primitives underneath the hood. So processing data and streaming, you might use Langstream which is actually an open source project from us. You might use things like Spark. You might use Ray. There's basically a lot of pipeline software. You have a vector store. You have embeddings and you have a data lake as well. So what is the high level? And then when we talk about structure, I'm not gonna talk about structured data pre-processing. That's a pretty well-known understand thing. But if you're talking about unstructured data, the typical process is that you would take a document, chunk it into smaller documents, perform what we call an embedding and store that information in the vector database. And what an embedding does is that it takes raw text and puts it through a machine learning model to create a vector, a 1,500 dimension vector. And this is what allows you to do what we call similarity search and do it at scale. And actually, this is an area that's gonna evolve quite a bit because with multimodal you have pictures and you have text. But a lot of the information that you have in, let's say, the documents, they actually come in pairs. So for example, if you have a picture, usually there's a caption and a picture. So the ability to represent both the caption and the picture as a single embedding is gonna become a very, very interesting area in the next few, next few, actually I think in the next quarter, this is gonna explode. It's gonna be very, very interesting here. All right, next is prompt construction. So to construct a prompt, you need all these data abstractions, you need your vector store, you need your data lake, and you need orchestration systems, and you need someplace to host the agent so that it can construct the prompt. And so the anatomy of a prompt looks something like this. So let's say you give a prompt, you say you have a query, like how do I generate a token on Cassandra? In order to create a prompt that is relevant to the user, the first thing you would have is what we call directives. So for example, you're an automated chatbot. This portion already constrains what the LLM would give in terms of answers, so it helps improve the performance that way. The second sets of data is what we call structured data. So for example, the user's name, the language, the activities that the person performed in the last five months, you want all that data in your generative AI, in your prompt as well. And then you have your unstructured data. So for example, you could be pulling in conversations with previous customer support engineers or previous interactions between the user, other users and the agent, but with a similar question. So pulling in relevant data from other conversations. This is all necessary to create really contextually relevant prompts. Now, this is where it gets really interesting and very subtle. I am showing you the benchmark results from Gemini. And one of the things that you will see in the Gemini, all the results, they talk about few shot or chain of thought in their results. So what is this? It turns out that these models only operate very well if you have examples similar to the question that you're asking in it. And I'm gonna give you an example of what an example looks like. So for example, if you ask a math question to GPT-4, something like this, the answer will be wrong. However, if you provide a question that's similar to the question that's being asked, so for example, this Roger has five tennis balls, blah, blah, blah, and you provide an explanation of the answer, how the system arrived at answer, right? Not only will the LLM be able to give the right answer, but it would also be able to give an explanation for the answer. This is really important for things like medicine or finance, right? Somebody tells you to go take an aspirin. You wanna know why I should take an aspirin, not a Tylenol, right? So it's very important. And so all these things over here, in some sense, if you don't have examples in your prompt, you're never gonna get these results. Now you can ask, why didn't they show the performance numbers with no examples, aka the zero shot? Because the answer, the performance sucks, right? The performance sucks. So you absolutely have to do something like rag in order to make these systems work, right? The only part that doesn't suck is actually in coding and I can get into a whole other reason why that works, but if you don't have some sort of rag-oriented techniques where you can grab examples, it's gonna be problematic. The other thing that's also kind of a subtle part about all these benchmarking is these examples that were used in the prompts were handcrafted by human beings. So actually a lot of the hard thinking that people attribute the LLM to be able to do, no, no, it's actually human beings doing this. We do not have the luxury when we build a production system to handcraft examples and put them into prompts. You have to be smarter than that. So where can these few shot or these chain of thought examples come from? One mechanism is get it from a vector database. So for example, this particular example, this particular chain of thought is semantically similar to the question that's being asked and you can pull that information from a vector database, but you can go beyond that. So let's say for example, this answer was correct and the user said, gave the thumbs up, the answer that you gave me was correct, what can you do? You can take that example, you can take that interaction with the user as a signal to put the example, a new example back into the vector database. And what this allows you to do is essentially bootstrap learning, few shot bootstrap learning. So this is something that you're gonna be hearing over the next few, the next little while, whereby the systems would use a vector database to get examples. It would use few shot into the prompts, it would get an answer, it would leverage orchestration systems to figure out whether the answer was right and wrong and repopulate the few shot prompts in the vector database. So this is something that you should be looking out for in the future. An example of this was the use case I talked about for Skypoint in where they actually had a lot of few shot prompting for generation of SQL statements, taking natural language and turning into SQL statements. And in particular, when the SQL statements required to have joins like group buys, et cetera, group buys or joins across tables, having this dynamic few shot prompting drastically increased the accuracy of this LLM application. All right, the next workflow is prompt execution. So it's not just about executing the LLM model. As you heard from the weights and biases guy, you also need to make sure that the answer is correct. You need people taking user's feedback, you gotta look for drift, et cetera, et cetera. So I think LLM monitoring and LLM ops is gonna become a very important part. The other important part about execution is that you need history of the chat interactions as well as caching from a cost perspective. And this is where it actually starts very interesting because these are kind of like a little bit of aft thoughts right now. And the reason why this is important for the Cassandra community is that one of the major use case of Cassandra is actually caching a lot of the predictions from AI models. We're not talking about generative AI, I'm just talking in general, a lot of caching is being done with Cassandra. And the reason why this becomes very important for LLMs is because Cassandra's ability to write immediately and index immediately allows you to have cash coherence. And that's super important for an end application because if you don't have cash coherence, then you're gonna have inconsistent answers coming back from your LLM if you put a cash in front of it. The second area is chat history or this agent memory. I think that the amount of data is gonna come out of these LLMs, like all the thoughts that are coming on LLMs are gonna be crazy. And be able to store all the information, when you're building your prompt, you can't just use the last 50 interactions because if you have more than 50 interactions then the LLM is going to be forgetful of the previous interactions with a customer. You can't put in all the interactions because then you use up the context window of the LLM. So you need to be very smart about leveraging based on the query to populate the history from let's say a vector database into the LLM, sorry, into the prompt that gets sent to the LLM. And then again, if you don't have things like cash coherence, if you don't have coherence, AKA if the indexing is very slow or it's inconsistent, then that means that your chat bot starts seeing forgetful. Again, this is another aspect of where we think the stack is gonna change. All right, the last workflow that we need to think about is what I'll call multi-agent orchestration. So we already talked a little bit about multi-model orchestration. You have things like Lang chain that is talking to multiple models, but what about multi-agents? So having multiple agents talking together. An example of multi-agent is multiple agent personas. And typically what happens when you construct a chat bot, you give it a single directive. Like you are a chat bot that's helping you in support or you are a artist or you are a director. Now there are cases where instead of just like how in real life, you don't wanna just ask one person one type of persona you wanna ask multiple people. A prompting technique called solo performance prompting allows you to create multiple agents with different personas just through the agent directive and ask a single question to all of them and summarize the answer for multiple domain experts. And this is very useful, especially if you're creating a chat bot that is designed for more open questions across multiple domains. And so I think this is gonna become a very interesting aspect of that. All right, so I didn't wanna pick and choose the winners here, but in 2024, there's gonna be a battle. And the battle is for the orchestration layer. And really the battle is Lang chain, llama index versus the hyperscalers. So why is this gonna be the battle for 2024? Number one, as I showed you, actually orchestration layers at the top of the stack. Everybody wants to own the top of the stack. You own the top of the stack, you can help dictate what's in the bottom of the stack. You have much more stronger influence on that. So nobody wants to become a commodity. And so when you have companies like Vertex AI that spent literally billions of dollars in the last years of the biggest, it's the biggest engineering project in computer science, probably in the last four or five years, you don't wanna be commoditized. You wanna have control over yourself. Vertex extensions is Google's version of orchestration. AWS has knowledge base, which is another mechanism of orchestration. Open AI just released an assistance API that also does orchestration, right? And then you have all the open source guys who are doing it too. Lama index, it literally is llama index and Lang chain. Those are the two big projects. One of the, as a database vendor, we're trying to integrate against all of them, obviously. For us, we also recently released a rag stack. Rag stack is a curated version of Lang chain. And people love Lang chain. People also hate Lang chain. The reason why is because it gives you choice. There's 500 software integrations plus 20 research papers implemented. The problem is also choice, right? Like, I don't need 500, most organizations don't need 500. And so we're working with the Lang chain team to figure out what are the ones that for data stacks customers and Cassandra users, which are the best rag techniques to use and what are the best software components that to use as well associated with Cassandra for doing generative AI applications. And we've also thrown in CASIO, which is our data abstractions. Why I talked about the system of context. It's our implementation of a lot of the data abstractions and Langstream, which is a real time data pipeline for generating embeddings, things of that sort. But this battle is much bigger. This battle is actually about open versus proprietary. And I think as a community, we have to really think about what will the world be if the orchestration engines are owned by big tech or proprietary groups versus the orchestration engines that are being popularized being open. My personal opinion is open is better. And I think it's a responsibility for everybody in this room to support people. If you believe that too, it's important for you to support people like Jerry who presented at Lama Index. It's important for you to support people like Harrison, as well as all the open source companies that are building this open source orchestration systems. If you like this, there's two more talks. CASIO, which is gonna be talking about tomorrow. And then SkyPoint, there's gonna dive into it a little deeper on what these more complicated agent systems will look like in production in the cloud. So the conclusion is generative AI agents will resemble more about the human brain and how humans interact over the next few years. Generative AI has at least five major workflows and a whole new stack to support that with major workflows. And the battle for the gen AI stack has only just begun. Thank you. Just a little plug, if you want more of the references in the papers, I have a white paper here that's gonna be published to the website and so feel free to grab some. I got some time for questions. So you're talking about for few shot prompting, right? Okay, yeah, so dynamic, so this is a whole new area called dynamic chain of thought prompting or dynamic few shot prompting. I think what we're gonna start seeing more broadly is that you're gonna have orchestration engines calling out to other systems for doing planning and reasoning. And by the way, if you guys have, if there's not enough copies here, come by our booth, come by our booth, we have more papers there. Actually this is more popular than I thought it was gonna be and one more other thing too, if you just don't want, if you don't wanna come by our booth, this will be really big help for us. Please take a picture of this QR code and we'll send you an email with this thing too. So that's another way to do it too. All right, so they continue answering your question. So there's multiple ways to do dynamic prompting. One other mechanism that I see is that people are gonna use probabilistic reasoning for doing reasoning. So they're gonna basically pick and choose the examples, leveraging things like Bayesian learning, combinatorial optimization, et cetera. So there's, and then there's also a whole another set of ideas of using planning engines as well. So you can, when I talked about this model here, when I talked about this model here, where there's a cost model and actor model, these can actually be invoking other types of AI systems such as a planning engine to come up with a set of actions. In fact, there's actually a lot of rumors. There's this thing called Q-Star that's been making the rounds on the internet, talking about exactly that because we've actually hit a limit. Actually this is actually a very interesting point here. If you look at the, if you look at these numbers here, there is a benchmark called Big Bench Hard. These are kind of reasoning benchmarks. These are reasoning systems. And despite having five times the number of parameters and maybe 20 times the number of tokens, the performance of Gemini is not much better than GPT-4. Which what it means is that the transformer architecture is likely a dead end for reasoning type problems. And you have to actually, the hard problem is actually going to be picking the examples that you would use in chain of thought prompting. So we're gonna start looking at a very interesting evolution in AI architectures. So quick question, you mentioned about the reasonings. Is that part that reacts? Is one of that alternatives? So react is one of the patterns for that. But even in react, in the react pattern, you have to choose which reasons to put in your react pattern. And also your architecture is very familiar with the one that Yan Nenchen proposed. The word model, is that a similar concept there? This is correct. So this is, I'm standing on the shoulder. This is not, by the way, this model over here is, this is inspired by the human brain. So Yan Likun's model of how he sees autonomous AI agents is very inspired by the psychology and the neurosciences for that. And so that's where it's coming from. So I'm wondering which is the data stacks and also the models that you have talked about today. It's the short-term memory. It's only that part. Yeah, so data stacks is short-term memory, but the orchestration is like Lama index and open Lama index and Lang chain. So where does the reasoning will play a picture? I mean, play a role here in this picture. So the reasoning portion is actually in the cost and the actor. So how, the reasoning is like, if you get a future, if you, the world model generates a future state, you need to use reasoning to figure out whether that's good or not. And then coming up with the actor, the actor is coming up with plans, that also requires reasoning. I could talk to you more about this. Okay, thank you very much. This is like a multi-day conversation. How do we make AI agents to reason? Rahul? Hey, thank you for a very succinct, I know it was 40 minutes, but I think it was the best explanation for people of what this is all about. And I really appreciate that. Yeah, you're mentioning Q, which is basically a simplistic mathematical engine, which may be a contributing factor to a human-level AGI. So I'm wondering what your thoughts are on the projected release of a human-level AGI in the next 12 months. And how does that change what you're talking about here? Yeah, so I'm a personal believer that I wouldn't say 12 months, but there's a real possibility that we can get human-level AGI in the next five years. This was actually my last project I was working at at Google Research as well. So this is not new stuff, but we will have to see. I think that there is a real possibility. It was really interesting to see kind of the idea of giving it examples. And my first thought that came in mind is, well, what if you had a vector database with lots of examples? And you showed that diagram that showed calling a vector database to that. But then I thought I heard you say later that the act of asking the vector database for examples and just generating those maybe doesn't work as well. Well, it depends, right? So again, this is kind of what I would say, yeah, so there's pros and cons of this, right? Because what happens if the few shots that you have aren't related to the question, but you use it anyway in your prompt? It's actually gonna make the things worse. You're actually confusing, like for example, if this particular question came back with an example that had nothing to do with math, this was nothing to do with math. It would actually confuse the LLM. It would make the answer even worse, right? And then worse yet, it could actually create bias. So for example, if we are doing this loop where you're using the answers and the user just says, hey, yeah, that's a good answer, but it was actually a wrong answer. You throw it back in your vector database, then you bias the system. That's just like a human being, right? Like you bias the entire system by having these kinds of reinforcing loops. And that's why I say fine tuning is actually important because what happens is that when you, just like a human being, you don't want to make your decisions based on one or two data points or a few examples in your life. You wanna do it on a large set of examples in your life. So we're gonna start seeing this kind of concepts of fine tuning being integrated with a few shot prompting from a vector database coming over time. Guys, don't ask physics majors about humanities questions and vice versa. Yeah, exactly. Don't ask MongoDB developers how to do a virtual consistency on Cassandra as well. All right, so, okay. What is it, web scale? Yeah, is it web scale? Yeah, that's a question. Don't ask MongoDB. Oh yeah, go on. Any other questions? All right, well, thank you very much and come by our booth. Thank you.