 Let's just get started, here's the agenda. This is mostly an introduction to vector databases. How many of you have already used a vector database? Great, that's exactly what I expected. Those two that had used them, you can check your email. Everybody else, I expect eyes up, eyes up. Oh, she's in trouble. Okay, good. All right, next is how are they different from a relational database? Because that's what most of us are familiar with. Where do you use them? So what are the use cases and what does that mean for you? And then I want to make you the life of a party, because now you can talk AI at the party and vector databases are all about AI, and so I tell you, you repeat this stuff, people flock, flock. All right, so this is how we were, I don't know, say about a year, a year and a half ago. I'm also showing my age. How many people recognize this episode? Yeah, so if you haven't, this is a classic, this is the Trouble with Tribbles on the original Star Trek, and you need to watch it. Because we started like this with vector databases, and lately it seems like there's a new vector database or vector capability coming out every single day. Everybody, you get a vector database, you get a vector database. So everybody's got vector databases now, and I want to kind of get you level set with what they are. So what is a vector database? A data store that works with vectors. I'm done. No, so let's talk about what vectors are. And you will also sometimes hear these called embeddings. And I'll tell you why they're embeddings in a bit. So the whole thing here is we're turning things into numbers. This is, vector databases primarily work with unstructured data. You can do it with some structured data, but the main use case right now is unstructured data. These are challenging for computers. And the reason is computers can do numbers really easily. Is two less than three? Yes. Is this kitty less than that other kitty? The computer can't say anything, right? Because it doesn't know. But what it can say is exact, but other than exact, is this image the exact same image? It's very hard for a computer to make a decision about that, right? Because it's not a native data type that it can work with. So unstructured data here can be pictures. It's long bodies of text, not single lines of text. Audio files, basically anything long and unstructured. Or big and unstructured. So neural networks to the rescue. Everybody knows what a neural network kind of is? It's the way you take data and you do stuff with it and then stuff gets spit out the other side. It's like a regression model, except this one's more about prediction, usually. So what happens here is you take your kitty picture, you feed it through the neural network, and these neural networks are optimized to find semantic meaning within whatever you fill inside of them. Like when you hear, I forget how many parameters are in OpenAI, like nine billion parameters in the OpenAI models or something like that. Does everybody remember regression? Yeah? No? Yes? Okay, so remember you had like three things and that was like three parameters, right? This has nine billion of those. So basically this is a very, very advanced curve fitting algorithm. It's basically dividing up the space as best it can, using nine billion parameters to make that space. And then what comes out of the other end is your vector, right? And this vector is a vector representation of that kitty, right, and there's the length of the vector. We probably fed in, that image is not that big, but still big, but we could end up with like only a 512 length vector, which means there's only 512 numbers to describe that kitty picture. Get the most information out of it. New cats were harmed, just letting you know. I'm gonna go into a little bit on the pre-step now. How many of you have heard of tokens and context length, right? And we're like, what is that? So these are important for the API costs and also for context length. So API cost is, when you submit something to OpenAI, they charge you per the number of tokens you can put in. And context length is, the amount of text you can fit into the token determines how much context it can have for the words around it. The longer the length, the more words it can put together to make more longer meaning. This is tokens, right? So this is from the OpenAI site. I put in data on Kubernetes. People love Transformers. And the color below is showing you how it broke that into tokens. So if you submitted that to OpenAI, it's gonna charge you six tokens for that. Make sense? And then when they talk about context length, they're talking about length of tokens. Everybody good now on tokens? If you weren't, okay. The next thing is embedding. So that vector that came out. There are more and more embedding models that are available to use. You'll see press releases. I think the latest with Gina, with like an 8K vector embedding or something like that. It was another one of the big ones. These are the models you hear everybody talking about that are super important. The ones we care about today are neural networks that have been pre-trained on large data sets. So the model, by default, doesn't have any weights on those parameters. But the ones that everybody's are so excited about that they've been trained on large corpus of text or lots of images. They've already calculated all the weights. So if you feed in the picture of the kitty, it's gonna say, oh, that's a cat. Because it's been trained on tons of cats. So you actually really, most of you, really care about the one that's already been pre-trained. Remember that. There are several things to consider when you're picking an embedding model. One is the appropriateness for the task. I'm not gonna go into these, because remember speedy, but remember to come back and think about this. The size of the input, so what is the size of the image? That would actually be preset by that model. So for example, the clip model only takes like a 27 by 27 pixel image or a 112 by 112 pixel image. You have to scale everything to that size to fit it in. The length of the output vector, even within one model, there could be multiple output vectors, because that influences accuracy. But on the other hand, it also influences speed of computation. The shorter the output vector, the faster it can compute it. That's why you can actually do ultra-small models that you can actually run on your phone. And this is the place I just wanna show. This is like the GitHub of the AIML community. It's called Hugging Face. And these are all models here. The ones that we are primarily interested in are probably, oh yeah, this is the new one that just came out. These are the models that are feature extraction. We're extracting features out of the data. Some of these other ones that I showed, like visual question and answering, those actually might just give you the answer. They won't give you the vector embedding. Those are models that'll say, oh, you put in this text, I'm gonna give you the text out. We don't want those if you're gonna use a vector database. We want the ones that give us a vector. Make sense? Yeah, okay. When perfectly clear, nobody has any questions. Great. In case you couldn't tell, I'm a really informal speaker. So if you have any questions, please raise your hand and ask, because if you don't understand it, it means somebody in the audience also doesn't understand. And I want everybody to be along on the journey. So let's go back to the talk. So now that we have our embeddings, what do we do with them? We put them into vector space. So that 512 length vector is a 512, the 512, what do we call it? D space. 512 dimensional space, right? And so what you do is you take that vector and you plot it in 512 dimensional space. And then you take the next picture of the kitty and you plot it in the 512 dimensional space. And what's happening, imagine with your mind's eye, you're looking at a 512 dimensional space, right? But we're looking at two. So what will happen is the kitties will end up next to the kitties in that vector space, if the model did it right, and the dogs will end up far away, right? So similar things should cluster together in space. Does that make sense? Right, yeah. So the question was, do they cluster along all dimensions or do they cluster along certain dimensions? We don't know, because I can't think of a 512 dimensional space in my head. The idea, though, is when you plot those 512 coordinates, if you take this and put it into 512 space, the pictures will be close to each other, just like points on a map, okay? So similar things should end together, yeah. So if you think about your clustering problems, that is definitely one of the use cases for vectors, which is like, does everybody know what K and N is? K nearest neighbors, it's a nearest neighbor's clustering algorithm. And so when I first came to this space, I was like, oh, I've been doing all this, and this is principal components. I basically take a thing and I turn it into a vector, but it's more complicated. There's a fancy model doing all of that, and it scales way better, okay? Any other questions? Okay, so now we've got our things plotted in space, that you say, great, I've got a database, you put these into the database, and you're like, okay, great, now what do I do? How do I work with this stuff? I've got a database full of vectors. So here, we're gonna ask the database to show us similar pictures, right? All the vectors have been put in, so a new cat. So the first step you have to do is you have to take your new cat and turn it into a 512 dimensional vector space, right? It has to, you have to take that image using this exact same model and put it into the exact same coordinate system. Does that make sense why? Right, yeah, because otherwise you can't compare them. So you make that vector, then you take that vector and you send it to the vector database. And you say, hey, find me things that are close by in space. So the new picture is the green dot, can everybody see the green dot? And so that's gonna say, okay, great, I know things that are close, I'm gonna return them to you in distance order. So what ends up happening is the kitty that's closest comes first, the next cat, and then the dog, because the dog is really far away. So typically when you query a vector database, you use limit statements a lot, right? Because it can actually return the distance for everything, but by default, it's usually limited to something like 20 or 15 or 10. It's up to you how many you want back. Does that make sense how the query works? Okay, and this is why small models matter a lot. If you're vectorizing something on your phone, it doesn't have as much resource computation as a supercomputer, so you're gonna want a small model on your phone and a small model using for the vectors in your database so that you can compare them. Yeah, can someone repeat the question because I couldn't hear? What a great question! Oh, the question was how is the data stored in the database so that it can be an index so that it can be queried efficiently? Great question! Let's talk about HNSW, which stands for Hierarchical Navigable Small World. Say that 10 times fast. This is the predominant algorithm used, or indices used in most vector databases. We're gonna start at the top, right? So imagine our 512 dimensional space, but here we're just gonna have two, right? And I put in that gray dot. What the index does to start with is it draws the random dot all the way, let me see if I can get the mouse moving. It'll start here. It makes a very sparse node network to start with. It's all nodes in a graph. And the first step is it starts with a very sparse network. It's got the network all built, who's next to who, that's the index, right? The index is like, who's next to who, what's the relational graph? The first question it says is, okay, I'm gonna start at some random place in that coordinate space, and in a very sparse representation of that network and find the one that's closest to my query point. Does that make sense? It's basically starting making that decision very quickly. It's trying, what it's trying to do is find the answer space very quickly, because if it walked the whole graph to try to get to the answer space, it would take forever. So it starts with that big one and then you see, so this is the one that's closest to it, of the two that were in the sparse space, then it goes down to the next level. And then the next level, the network is denser, right? And so what it does here is it does that same thing again. It says in this dense space, what is my closest neighbor to that query space, to that query point? Make sense? Okay, and then when it gets down to the final level, it says, oh, here's the neighbors, rank them, send them back. So there's way more than three levels, just to let you know, but that's the general algorithm it uses to find where it should end up. It basically walks the sparse graph until it gets very close and then picks the ones, when it can't, actually, I know what it says. It says, I can't find anything closer, and then it stops. Yeah, it's its own thing. But I mean, I think Neo4j has added vectors. I don't know how they're implementing under the hoods, but there is that graph, connectivity graph. And it's way more complicated than this. This is just to give you an idea, because you can actually filter this graph using attributes, metadata. You can say, oh, I only want all the male cats, and it'll only walk the graph, ending up with male cats. Yeah? What to jump to? The question was, in a 512 dimensional space, does it have to walk through all 512 dimensions to get there? No, there's no correspondence between the number of layers and the number of coordinates. Okay? Yes, question. It's clustering at different layers of granularity. Is that right? So the layers are generated on the fly. Does that make sense? It starts one place, because you have to start in a new place every single time and then start walking the graph. Because you don't know where that, it's this huge space, and there's a point that I'm giving you, and I'm saying, okay, find the ones that are closest. You can kind of think it like, does people know like, what's descent? Gradient descent? Everybody heard of gradient descent or hill finding or any of that stuff? Kind of simulated and kneeling? It's somewhat similar to that idea. The idea is, start somewhere random, use some sort of algorithm to find things that keep flowing closer and closer to the point that you want to end up in. But this way it doesn't, like sparse network, less sparse, less sparse, less sparse. It just keeps walking, but it doesn't want to search that whole space. The other thing is, when it gets down to this dense network, it doesn't have to pull down all the dense network. It's already walked itself into a much smaller space that it has to pull back. Because if you said I only want 10 back, maybe it pulls 100 next to that last one and says, okay, which is the most dense? Any other questions? No fair. He said he's written a ton of blog posts, just heads up, he's written a ton of blog posts on vector databases and they're his favorite thing. Heads up if he gives me a question I can't answer. No, you are amazing and the way you are explaining it, it's fantastic. The question I have is, why is the top layer sparse and the next layer more dense? So the, oh, you just recorded. So the reason it starts sparse is because we want to move quick, right? If we had a very dense network at the top, we'd have to keep walking, walking, walking, walking, walking, walking. So you start really sparse to get you in the neighborhood. And then once you're in that name, you can think of it like a plane flight, right? Like, oh, I want to go to, what's the city in Europe? I want to go to Stuttgart, right? Well, the first thing I'm going to have to do is pick the airport that I want to go to. So that's the, you can think of that as the course network. I'm flying to Stuttgart, got that part. Okay, now that I'm in Stuttgart, I need to find which part of Stuttgart my hotel is in. So then that gets you down to the next network. Then I have to find actually where on the street my hotel is. And so it's this kind of getting finer and finer coordinates as you get closer and closer to the thing you want. We good? Wait, hold on, let me check timing. Oh, I got seven and a half minutes. One question. So these layers, are they also corresponding to the layers of the neural network that gave us the vector? No, these layers have no correspondence to the neural network. These layers are just in coordinate space, okay? So you basically take your entire graph and you sample it for the sparse graph. Good? And this is one of the things that this is, is an approximate nearest neighbor model. So the guy who before was asking about k nearest neighbors, this is not k nearest neighbors. This is not exact, it's approximate. So doing the query twice, you might end up with, you should hopefully end up with the exact same answer, but there's no guarantee that you will. Because we're getting the approximate answer. You can tell the database is usually I want the exact answer, but be ready to wait. Because it has to do a lot more work. And in most cases, approximately the right nearest neighbor is the answer you wanted anyway, okay? There's other techniques to do this, but I only know HNSW. So what are they good for? This is meaning vector databases. They're good for questions around similarity, right? Usually when we think of relational databases, we think of date equals this, age and this range. These are very exact queries where we can calculate exact things. That is not a good use case for a vector database. Vector databases are like, what's similar? And it actually turns out there's a lot of questions that revolve around similarity. This is what I just said. It's specialized for a particular use case, right? I see some people talking about, oh, should I put all my data in the vector database? No. This is when you want to do similarity type search, this is when you can compute vectors and you use them, right? They supplement your data infrastructure. They provide memory for your AI models. So when you do make those big models and they have the vectors in them and you're gonna do a similarity search, if you don't have a vector database, you have to keep that model running at all time and it has to store all the points in memory, right? Because it needs to be able to make that query or it has to recompute them every single time you make a query. But if you don't have a vector database, that's all gonna have to be in memory with that model somewhere and you're gonna, so for you to be able to search it. And it's way cheaper to put them on disk or in RAM for the database, though. So they reduce your cost for running your AI infrastructure, right? This is my favorite part. So, I got this one from my therapist. I'm a divorced man and one of the things I learned from her, which I think you should use, this is, forget the rest of everything else I told you. If you just remember this one thing, this is the important thing to remember. Clear boundaries, infinite possibilities, right? So if you look at where most fighting happens in an organization between you and your partner, between you and your kids, it's who thinks they own the thing you're arguing about and who's responsible for it. And if you set it up clearly, then it's not a concern of the other, I mean they can say stuff, but you can say, that's great, you said that, but this is what we're doing because it's my concern and I get to do it. And this is where vector databases shine because how many of you are application developers? Yeah, and how many of you want to learn all that neural networking stuff versus just asking a similarity search? A couple, right? But most don't want to get that, that's probably because they're looking for a new job. But the most don't actually want to go that deep into neural network models. They just like, I want a better search. And data scientists, how many of you want your app developers building those models and creating those vectors? None, because that's bad news, right? That's how things go wrong. And so the idea is the data science team or whatever you want to call that in that whole, the very first talk where we saw that whole architecture and feeding of data and going into places, one of those was outputs to places. One of those outputs defined by the data team should be the vectors going into the vector database. And they control everything before it gets put into the vector database and they or the data team decides on the indexing. And then the developer doesn't have to care how they were made, how they were created, all they have to learn is the API to the vector database and make their queries and get the results. So it's a neutral ground where they don't have to argue about how they're gonna implement stuff. Because the data science team can say, this is the model we used, use that in your runtime to make the embedding database. Or we'll give you an API to do it on this end. So clear boundaries, infinite possibilities. Everybody got it? Good. Example use cases, search. Search is similarity, right? Find me things like this. Clustering, like we talked about before, find other things in this area. Recommendations, right? Because you can feed in a recommendation as a vector and get other recommendations back. Anomaly detection, right? If it's not close enough, you can say this is an anomaly. This is a common use case actually. Diversity measurement, how much, how spread apart is everything? Classification, this is very common with images. What's in this image, right? And those are some of the use cases. How many of you have heard or seen all this talk about retrieval augmented generation, RAG? Anybody seen anything about it? Okay, well if you start reading this literature, you'll see stuff about it. It's the new hotness. It's even hotter than vector databases. Then the idea is you have some sort of generative text model. That's like OpenAI's text model, any of those other ones. You have one of those things that you're gonna feed a bunch of text into or question to and it's gonna spit out text to you, right? And lie, probably. So OpenAI, the other assumption is OpenAI, well we use OpenAI in this case, has trained their generative model on a large amount of data. So it's very broad, right? It's the internet, so it's pretty broad, even though some of this stuff was copyrighted. And then you have those vectors, you have vectors, your own vectors for your documentation in your database. And so what you're trying to build here is a better answer to people asking questions about your documentation. So OpenAI is not gonna give a really great answer because it has the whole space, right? It was trained on something different and your documentation has much better much better information and vectors, but you don't wanna retrain an entire model. So what you do is the user gives you the query, you turn that into an embedding, like we said, same thing as before, search your documentation with this embedding. So find your documentation for the things that are closest to the user's query, get back however many closest documents you want, add those documents as context, augmentation to the original query. So if I said, let's say a VM, what is a VM? I'm at VMware, right? Our documentation has a lot more than the entire internet. You would pull back and it would say, what is a VM? Here's some information for context, blah, blah, blah, blah, from our documentation and then put that to OpenAI. You put the context back in the original query. What this helps you avoid is fine-tuning, right? Because what other people will do sometimes is they'll say, I'll take the OpenAI model and I'm gonna fine-tune it with our stuff. With this, you don't have to do that as much. It's just like this competing space. Remember this, fine-tuning versus retrieval augmented generation. Too many acronyms. Any questions? Just let this one flow over you. I'll give you the URL for the slides at the end, but next time you see RAG, you'll be like, oh, I know what that is. I know where I can go look that up. There are two types of architectures, really. For vector databases, right now we're seeing add-ons to existing databases. And in this case, it's usually a new data type with new indices and functions. Or their single purpose. There's quite a few in this space as well. They're not transactional. Those new ones are not transactional. They're built under the base, the whole consensus, eventually consistent horizontal scaling, right? The systems tend towards scaling as the same as their base system, right? So single purpose are built to go horizontal and some existing databases are built to go vertical, so that's the way you're gonna scale them, even in your Kubernetes architecture. It might be easier for some of us with the single purpose that knows how to horizontally scale because you can just spin up another pod and it'll know how to automatically put itself into the cluster. So what this means for you, we're almost done. They tend to be horizontally sharded or distributed, so plan accordingly. It's much easier to add, like when you're running out of memory for your index, add another instance and it'll take some of the memory and put it on that instance. There's a lot, and I mean a lot, walking that graph and pulling back the data because they're all over the place. There's a lot of random reads. So IOPS is crucial here. You are not doing network attached storage. This is like, if you get an instance in some ways, NVMe locally and then make sure you've got it high availability so that if one of them dies, you can replicate it somewhere else. But this is definitely where you wanna use NVMe at least SSD. The indices are big and you want them in RAM because you want them to walk that graph as fast as possible. So if I haven't said it before, you need fast disks and lots of RAM. What's new for a database? But this case, it's really, really sensitive. Your streaming ingestion pipeline is probably what's gonna handle your embeddings. Most of the databases do not create the embeddings for you. You have some sort of pipeline that pipes the embeddings in. There's a bunch of libraries like Lamaindex and LangChain that make it easier for you to create embeddings if you don't know as much. Take a look at them. And contrary though, nice squeak, going through puberty again, they reduce overall data stored in the database because actually this is a compression technique. Turning something into a vector is actually a compression technique, right? Because I took that whole image and I just made it into 512 floating byte numbers. So that's a lot, much smaller in terms of it. So the data actually stored on disk for those things is actually less than the original. And given the new bigger AIML push, they are definitely going to be part of your data infrastructure. There's no avoiding this. So my talk here was a hopefully, like when I go to read the document, when I go to read about this or start to learn more, I was trying to give you some grounding in space about, whoa, that's what that means and this is what's happening here and now I get that. So sum it up, in A-M-I-A-L, vectors refers to generated numerical representations of unstructured data. The vector encodes meaning into a multi-dimensional space. Vector databases allow you to store and query vectors. They handle questions related to similarity. They're usually distributed. Hang on, it should be an interesting ride. Thanks and enjoy the vectors. So that is the URL for the talk. Any more questions now? I don't know if we have time. How much time do we have? How we do it? We're going into a break. You guys don't need a break. OK, any questions? Or you can come talk to me afterwards. Yeah, question? The vector databases, when we query them, we'll return a set of nearest vectors, right? So let's say we query with some picture of CAD. It will convert it into a vector, query it, and the output is like 10 vectors. How do we go back from that vector space to the actual CAD picture? Do we store references? So the question is, how do you get from your, once you've done that query, right, like we got that vector back that was the cats that was close to the other cats and the dog that was far away, right? We didn't get back the actual pictures. We just got back the ranks and the orders. You have to put something in the metadata that links the two together, like a foreign key kind of idea. Like so usually, if I'm doing stuff with images, I'll put a URL to the image in the metadata. I won't index on it because I'm not going to search on it, but I'll return it and then pull it. So for most of the vector databases I've seen, so if you use something like PG vector in something like Green Plum or Postgres, then you're going to actually probably have really nice structured data, and you're using the vectors as an augmentation to your structured data. In something like the new ones, what you'll probably do, they usually take unstructured JSON as the metadata, and then the vector is attached to that, right? That's usually called the payload and the vector. And the payload is the metadata, and it's usually unstructured JSON and have at it. Yeah, can you use it to do it? Yeah, thanks. Yeah, you're welcome. Can you go back to the previous two slides where you said store embeddings in memory? So indices are big and should be in RAM. How big do these indexes get? And how much memory are we? You know what I'm going to say to that. It depends. Exactly, right? So this is another part where I talked about vector length, right? It depends on how many you have, and it depends on vector length. The larger the vectors, the bigger the index, right? So that's why if you can get away with a 512 one and it's accurate enough, you're going to pick that rather than a 4K one, right? If it's good enough, like you have. So there's always a period of experimentation when you're using these. Just find out if it's good enough. Again, what did I say neural networks were? They're a fancy regression model. Do regression models make mistakes? Yes, by their very nature, any statistical model will make mistakes. So you're not going to get 100% accuracy with this. So you have to actually play with what let you trade off accuracy versus speed versus there's a whole bunch of things. So no clear answer about it's just a lot. From what I've seen, it's a lot. I think I saw like what did I see? Like a million open AI, normally open AI embeddings with something like six gigs of RAM to start with just for the index. So I guess if you start getting into billions or lots of databases going on, it starts adding up quickly. Any other questions? I think we're out of time. I want you guys to get up and get drinks and stuff. Thanks, everyone. It was fun.