 The Databases for Machine Learning and Machine Learning for Databases Seminar Series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Google and from contributions from viewers like you. Thank you. Alright, thanks everyone for coming to another episode of the Database Seminar Series here at Carnegie Mellon University on ML for Databases and Databases for ML. We're very happy to have Etienne Dialocker from Webiate to talk about their system and what they're working on. So the goal here is for this to sort of be an interactive process. We don't want Etienne talking to just himself. So if you've got a question, feel free to unmute. Let us know your name, tell us where you're calling from and fire the question. And if you're not able to unmute, feel free to just drop it in the chat and we will read that out loud as well. And with that, Etienne, sorry, I forgot to mention Etienne is the co-founder and CTO of Webiate. And with that, thank you for being here and the floor is yours. Awesome. Thanks for having me. Yeah, so I want to tell you a bit about more about VV8, the product, not VV8, the company, but maybe also start quickly with the company VV8. So we are a startup that's been around for about four and a half or so years now, building the open source vector database VV8. The team is now around 50 people, so grown quite a bit recently and we're doing all kinds of cool stuff that hopefully I can share with you today. So most of this is going to be a tech deep dive. So I try to make sure that we don't have like any kind of marketing slides or anything on there and really go into what I think are the cool concepts. I do have a small sort of higher level introduction in the beginning just to sort of paint the scene for you, like depending on what your background is, you may not even know what a vector database is, for example. So I want to give you the motivation of why the world needs a vector database, but then dive right into the architecture in there when it's been quite some time on HMSW, which is a specific vector index type. Then also talk about product quantization, which is a super interesting topic in my opinions about compressing vectors or reducing the memory footprint that's needed for vectors. And also talk a bit about multi-tenancy, which is something that VV8 has a unique solution for. And there the idea is like if your user space needs to be separated into different user spaces, like how do you organize that in a way that it scales in there? VV8 can be quite helpful. So these are kind of like the three deep dive topics. Let's first start with why would you want to have a vector database at all? So for this slide, what I put together is the... So I may or may not have created that slide while flying on a Boeing 787. And that motivated me to just go for the Wikipedia. I did pay for the Wi-Fi, so I could go for the Wikipedia article. And just this is the first paragraph of what comes up if you look for the Boeing 787. And then on the right, you have a couple of keywords. And as a human being, you can probably immediately relate that these keywords are relevant for this kind of text. So we have stuff like aircraft, airplane, then a different spelling, like the British way of spelling, airplane, and sort of also a bit more abstract terms like Dreamliner, which is specific to this model, or 787, which kind of the number appears, but not the words. And it's easy for you as a human being to see that these are all related. But for a traditional keyword-based search system, this would be pretty hard, because I think the only word actually that would match in there other than Dreamliner would be the word airliner, but not aircraft, airplane, and these kind of things. And one way to overcome this is to not index keywords, but rather try and index the meaning behind keywords. And that's a bit fuzzy at this point, but we can make this a bit more specific by sort of placing the meaning behind these words and these sentences in a space. And sort of imagining a 1500-dimensional vector space is super difficult. But we could sort of imagine this as just a two-dimensional space. And for this something, an example that I love using is a supermarket, because in a supermarket, you can kind of intuitively navigate in a supermarket by using the concept of similarity. So if you would walk into this fictional supermarket, which I give the creative name Whole Floats, you would walk in and you would see that on the left, you would have sort of non-food items, and on the right, you would have food items. And if you're, I should probably say that as well, if your goal is to find carrots in that supermarket, then you probably immediately know that you don't have to look for carrots in the non-food section, so it can kind of immediately discard half of the supermarket. And if you go further in that supermarket, you would find sort of in the bottom right of that illustration, you would find the produce section, and it first starts with fruit, and then if you see maybe apples, and on the other end, you see potatoes over the island vegetable section. Then you know that the potato is closer to the carrot that you're looking for, so you're kind of like going closer in that direction, and that's until you eventually find the carrot. And that's kind of an oversimplified way of what we just talked about, of these limitations rather than sort of the carrot, rather than the supermarket being organized alphabetically, it is organized by these concepts, and that's the idea of the vector space. So if you use a vector embedding model, it would put anything, text, or any other modality into that vector space and sort of place it in that multi-dimensional space that I've super simplified by using this 2D map. And that was kind of the whole starting idea four years ago when we started VBA, but since then stuff has happened, most notably chat GPT came up, and with chat GPT, all of a sudden there was this new use case but also new problem, and mainly these large language models, they tend to hallucinate. So if you would ask chat GPT right now, does VBA make use of elastic search? It would actually say yes. So I tried this on chat GPT, and that is a hallucinated answer. It gives you a very confident answer saying, yes, VBA uses elastic search as storage background, blah, blah, blah, blah, blah, and that answer is wrong. If there's just one takeaway from this session, then VBA does not actually use elastic search in its architecture. One thing that you can do to fix this, as you can see from the screenshot, is you can tell chat GPT, please try again, but this time base your answer on the following snippet that I copy pasted from the VBA documentation, and then there's the snippet sort of telling you VBA uses a custom storage engine, blah, blah, blah, blah, all these things, and chat GPT sort of apologizes for the confusion, and in the end, if you read the summary, it says in summary VBA it has its own custom storage engine optimized for vector embeddings, and it does not rely on elastic search for storing and indexing. So the good part is we could get chat GPT to provide the right answer. The bad part in this is we kind of have to tell it the answer ourselves. Like we have to say like here is kind of the knowledge that you need to answer the question. And now to make that sort of to remove me manually having to copy paste the documentation, what we can do is something called retrieve log manage generation, which is basically a fancy way of saying retrieve the documents that contain the truth from some kind of knowledge base and put them into the context window of the model so that the model can create a better answer based on that context. And this is sort of compared to the previous one, it's a super, super simple addition. You just sort of put VBA in the pipeline there as well and the vector embedding search that we talked about before that really helps in sort of understanding the question and mapping that to the right documents. And then chat GPT can now say, I can't actually read it because I have the zoom overlay on here. Let me move that for a second. VBA would say VBA has its own custom storage engine optimized for working with vector embeddings, which is taken from the previous slide. And what's cool about this is, first of all, we didn't have to retrain chat GPT. We didn't have to do any kind of model training. We could on the fly adapt this simply by looping in the right context. And we could also change this on the fly. So a new document could appear tomorrow or we could remove the document, for example. So if there was a document that we don't want to show for whatever reason, we could sort of VBA is the database and database. You can sort of do crud operations and that gives you a way to sort of influence with the large language model would produce on the fly. Now I can't switch slides anymore. Okay. So we have an example app that's called Verba, which is a, that's the golden rack trigger. So it's a pun on rag and this, being this nice doggy in the image. So this is kind of meant as a, just take your own data and put it into VBA and have a retrieve log managed generation kind of example. And for this one, the standard one comes with our own documentation. So to sort of prove my point, I put the same question into Verba, asked it, does VBA make use of, of elastic search? And Verba luckily gave the right answer and said VBA itself does not make use of the elastic search, all the time a bit more context around it. So you kind of have the best of both worlds. You have the model creating that nice answer that the generative part, but also you can base it on facts that you can control yourself and that you can change on the fly. Yeah. So these are right now, I would say the main use case, there are plenty more use cases for vector similarity search, could be recommendations, could be all kinds of other ones, but these are the biggest reasons that we see right now of people adopting VBA. And yeah, with that I would like to dive a bit deeper into VBA's architecture. So if anyone has any questions on the, on the previous part before I switch topics, now would be your chance, but also of course we can talk about them later as well. VBA uses collections and charts. So small disclaimer here when I say collections, this is actually if you open up the VBA documentation right now, you will come across the term classes. This is something that we're currently renaming. So we've got some feedback of like collection being the more intuitive way of describing things in the NoSQL world. And so in a VBA setup, I use the setup here as a relatively broad term, not to mean it's like a specific sort of single note setup or a cluster of VBA is really just your VBA setup in however way you would have it set up. And you could have sort of three collections here, like article authors events. So basically it's like you're from user perspective, like your way of creating similarly like a table or so in the SQL world. Within a collection, we have charts. The purpose of charts is basically vector data or vector search in general can be quite resource intense and later we'll focus on how to reduce that, that costs through product quantization. But if you have a data set that's too large to fit a single note and charting is a way to basically split it up and then I have a slide on that in a second. So one collection can comprise multiple charts. And then within a chart, we have essentially three parts. When is the vector index? I'm using h and w here as sort of pretending that's the only option. That's actually not true. This is pluggable in VBA, but the like 99% of use cases that we see right now are sort of done with h and w. In the future, there will probably be more. There is also an object store. So this is also very important in VBA architecture. While it is vector native, it's still a pretty decent key value store under the hood. So there are no limitations around kind of objects that you could store with your data. So if you retrieve something, you don't have to let's say use a secondary system that stored your original object and then resolve that like where the vector search, which is give you back your ID and then you have to go to that one, but you can store the full object with VBA as well. And there is an inverted index that also uses an LSM store. And this inverted index is actually to my knowledge is the only LSM store that actually sort of natively understands roaring bitmaps. So that's an optimization that we made for basically combining different filters doing like and operations or operation these kind of things efficiently. And this all kind of lives together in that self-contained unit, the chart. And sure, yeah, works well together. So for example, E.T. in a quick clarification question and you could defer it if you're going to talk about it next. Will this also, if I have a collection of documents, let's say a whole bunch of PDFs that I have OCR and I'm going to go store indices in it. In the object store, what can I put in? Can I keep the OCR document in there too or what's the scope of what you would suggest people put into the storage layer here? Yes. So typically what we see is you would put, let's say you have a PDF as the raw source and let's maybe say the PDF also has images because I think images makes it makes it very interesting. Then what you could do is store basically the OCR text just as chunk text. So for example, for hybrid search, if you want to do both keyword matching or keyword scoring like BM25 in combination with vector search, then you actually need the text chunk. What you could even do is store the raw input, like the blob that is your PDF. Not sure if that's the best idea because if that gets a lot of data, this probably makes more sense just put that into super cheap storage like cloud storage and just link it from VV8. But technically you could do it. You could if you say like, okay, I'm okay with using SSD storage or so to store the raw blob in there, then you can store anything basically. But typically what we see is like a sort of semi-structured JSON or so where you would have maybe title and then the title paragraph and the paragraph or something like that. And the idea really from our perspective is we want to be able to serve a search request end to end. So that you don't have to serve at search time just because you're missing some information that you have to query a second database system for that part of the data. Got it. Thank you. This may be something you're going to get to shortly, but how do you do sharding over these vector spaces? Like is it something... Maybe you're going to get there. I'm just curious how you find like a sharding key or... Let me go to the next slide. I hope that moves in the right direction. So yeah, so for sharding, typically we shard on a specific key and sort of a two-step approach where the key itself would match to a virtual shard and the virtual shard would match a physical shard and basically with consistent hashing where the idea is that if you ever change the number of shards to minimize the data movement and this is actually something that we don't have a feature for re-sharding yet. Because the... And typically, and especially with HMW, like the cost of building the index is relatively high. So re-sharding on the fly is not something that we typically see people do, but we have the general architecture in place. So it's a similar to like DynamoDB or Cassandra or something like that where if we would say like, hey, we want to change the number of shards on the fly, we already have the mapping there to reduce the amount of data movement to make sure that basically only minimal amount would have to be moved. Other than that... What shards are you shard on? Do you just shard it like randomly partition the data or do you shard in a more systematic way where you might be able to filter shards when a search query or do you have to search all the shards on every search query? Yeah, so right now we just do a hash on the ID field. So that's basically random, but there is a second scenario. So this is why this slide says sharding in a single tendency situation. There is a second scenario that we'll get to later for multi-tenancy where it's not using filters, but it's kind of doing exactly what you're hinting at. If you already know that part of your data is specific to a specific shard or to a specific node, there's no need to search across all of them. So in that case, it's actually different. This right here is mainly for the idea where it's agnostic of specific filters where the idea is really you have a multi-billion scale data set that's just too large to fit on a single node and you sort of don't have any kind of good criteria to narrow it down further. So just because of the nature of the query, basically you know that you have to hit all shards anyway. So this is kind of for this case and then later in the multi-tenancy section. We'll talk a bit more about how to narrow the scope down based on the query. Yeah, so let me see. Yeah, so I mentioned HNSW which is the hierarchical navigable small world as a mouthful. I struggle with it every time which is a very common vector index. You see this a lot right now. I think sort of it's often used depending on what the perspective of the vendor is either they use it as a positive signal to say like it's a super fast and performant index or if the vendor has something else they use it as a negative saying like hey it's an index with a very high build cost. So it's just engineering. There are trade-offs and HNSW makes one of those sort of trade-offs. On the one hand for it's an approximate index so it gives up a bit of accuracy but you can still get to a very, very high level of accuracy to gain a lot of performance. And then the other trade-off if you compare it so this is compared to an exact index where you would have to do like a brute force search and then if you compare it to other indexes like HNSW tends to be very fast on the query side but relatively slow to build or relatively compute intensive to build. And HNSW is a graph based index and as I mentioned in the intro like I want to focus a bit on this part because I think it's nice to visualize and nice to understand sort of how you can use a graph to do an approximate nearest neighbor search. And this graph is a proximity graph and then the navigable small world part in the name is based on the concept that if you know someone who knows someone who knows someone with I think six or seven hops you can basically get to any person in the world. So even if they're like completely globally distributed just by sort of using those graph connections you can navigate basically anywhere. And this is the same idea in this proximity graph. So don't think of this so much as like a two dimensional vector space two dimensional is just the representation think of this as no matter what the dimension between two points are as long as you have a distance metric so cosine similarity or dot product or something like that you can represent this basically one dimensional, I don't want to say vector but a one dimensional value it's a scalar vector value basically that tells you if two points are closer to one another or further from one another and we can make use of this to do a similarity search. So what we've introduced in this new point here is the concept of a query vector that's that green dot on the right and just visually you can immediately tell if I want to go for the top three nearest neighbors that is basically just circle around that dot and that should be those three points on the left sorry on the right. Also what we've introduced is the concept of an entry point which is just for the sake of this demonstration just assumes a randomly chosen point that we use every time to enter the graph and now what we have to do is as efficiently as possible get from our entry point to the query point to those three points that are highlighted in that in that circle and for this in sort of slightly oversimplified terms what h and w has is this following algorithm the idea is that basically from wherever you are which in the beginning is your entry point your goal or what you can do is evaluate the edges so these sort of outgoing edges and jump onto those new points and then for every point you can since it's just the similarity comparison again to the query vector you can do a calculation whether that new point brings us closer to where we want to go or whether that gets us farther away from it and using those calculations you can actually pick the best point of your newly discovered set and make that the current candidate so you can see this is like a almost recursive kind of algorithm and then another sort of helper is let's try not to score any point twice so we need to keep some sort of a list of what points we've already visited and our early exit condition or exit condition is basically when we can't improve the scores anymore by evaluating our neighbors that's when we're done and I'll walk you through an example of that in a second just sort of to make it easier to understand the colors in the next slide we already saw the entry point, the query vector and the sort of blue which was just points that were there in the beginning now I'm introducing two more colors one is the gray one which is just this is a point that we already scored and visited in the past but we discarded it and then the light green is basically just to highlight what our current best candidates are so this is our the same graph as before the only differences that have highlighted the edges that are available to us from that entry point so the idea is like right now if we do that we can see that the graph now looks like this so a lot has changed with just sort of one iteration some of those points like the point on the left is marked gray we've discarded it because it didn't actually bring us any closer to where we want to go out of the other three points they're basically all three they're green but one is also red because that's the closest so you can kind of think of them probably should have been like both green and red at the same time so what that means is those three are currently our best sort of our closest matches which are already better than the entry point and out of those three specifically the one that's highlighted in red this is our new candidate our new entry point so we can basically do the same thing again from that perspective which gives us two new edges so why just two, why is there one edge in blue that's because it would point to a point that we've already visited so we're not going to follow that edge again because we know we already visited that so it only gives us two new options so if we repeat that we move to that point which is like just trust me that this point is actually closer than the one in the bottom that's very hard to see on the slide same thing from here we've actually only discovered one new edge because we're not going sort of on the top left edge because that would just lead us to a point that we've already discarded and then if we do that again we can move to our current candidate here similar situation here one of the edges actually points to something that we've already visited but one edge gets us to a newer point we will make that our new entry point and then finally if we do the same thing again we're now in the exit condition where we can't actually move any closer anymore because anything that's connected to our three possible candidates right now would not get us closer to where we want to go so if we would take the top green one and evaluate the edge to the bottom left that would be further away and similarly for the bottom green one we could only go left which would also bring us farther away from our our query so now what we've kind of done is like we've navigated this graph closer to our query vector and in this case we've actually identified the top three exact neighbors so that is something that's definitely possible but you can see that sort of roughly half of the graph is still blue and that's our efficiency saving basically so if we had done a brute force search we would have not have a graph at all and we would have just scored every single point in that data set compared to our query vector but because we were able to follow this graph we actually managed to not score half of them whereas a super small and small example which is for demonstration purposes but as this graph grows there is more and more that we can we can sort of not score and this is where the performance boost in HMSW comes from let me just because it's nice let me just go back and sort of do this again at slightly higher speeds so you can see it like almost like an animation of how we're traversing our results so you mentioned these are approximate sort of sort of indexes are there parameters you tune here to adjust that or and sort of how bad can it get I guess like with these approximate structures yeah you can one thing that's super important for HMSW and that's something that we actually ran into quite a bit in the beginning is you need a data set with a let's say useful distribution of where your data is at so if you would randomly generate vector embeddings just like a sort of random function you would have sort of all distances would be on average pretty much the same that would make for a horrible graph but if you use real life data sets so for example just text or so that tends to have these natural clusters it will make for a way better distribution so that is something to keep in mind for this sort of graph to be efficient but then you also have parameters that you can tune so one thing that's sort of implicit in this graph is just the number of connections so that's kind of I think the limit I think is four in this graph because no one has more like no point is more than four outgoing connections another parameter that you can tune that I sort of oversimplified in this is the the candidate scope so I've just pretended right now that we're looking for three results but I've also said that we're only ever looking at three candidates in real life that those are actually two different numbers so you would have the result number which in VDA is called the limit or typically it's just called the top K and then you have this parameter in h and w that's called EF which is the the sort of candidate window so it it's quite common that you would maybe search for your top 10 results with an EF of 128 and that would mean that you would have roughly 10x or 12x the number of results in your candidate set and that gives you an immediate sort of query time tradeoff between performance and accuracy because the larger you create the sort of temporary result set the more points you discover that you have to evaluate but also the more points you discover so the better or the more likely it is that the search gets better and then another parameter that you can tune that I didn't really talk about it so here I just pretended that the graph was already there but of course there's also build phase to it and for the build phase you actually do a search on the graph to find out where to place a new node and there you use the same sort of the same EF parameter and that's called EF construction then where if you do a better search at build time you're likely to place the node in the correct place which then in turn builds a better graph so that's another sort of query time versus build time kind of tradeoff that you can do and there on our website you can see all kinds of graphs so this is like the or if you look at ANN benchmarks from Eric Bernhardsen he has this independent sort of scoring of different vector libraries and you will always find these graphs that show the tradeoff between I think he uses throughput but the throughput is just it's single threaded so throughput is really just latency basically versus accuracy and then you get these nice curves where like the top either the top left or the top right corner would be the ideal scenario that nicely sort of visualizes that tradeoff one thing though so so far I've actually you weren't one of the parts of ancient stone just drink some water one sec so everything that I've described right now other than the parameters they're actually specific to HNSW is more NSW which was the predecessor to HNSW so we haven't really talked about that hierarchy at all yet and this is on the next slide so look at this sort of 3D flattened 3D view of this 2D graph where if again highlighted a couple of points so the colors have different meanings right now really what they mean is this is the yellow point, the green point and the red point this is really just so you can find them again they no longer mean like entry point or query point or something because now what we can do is introduce a second layer and in this second layer and this is the reason for those colors we actually have some of those points that also existed on a lower layer we also have them now on a higher layer but also where we don't have some of the other points and what that means is since we have fewer points on average our edges are longer which also means that if we move sort of if we navigate the top graph we're much more likely to cover more distance with a single hop so you can kind of think this and the query retrieval goes top to bottom so if we start on the top layer we have very very few points so if you go down to sort of see down layer per layer as soon as you've exhausted one layer and this is why you need to have points that are present on both layers so you could go from the top graph you could say like let's move to the yellow point and then you can't move any further to the left because there's no better result so the yellow point would now become your entry point on the lower layer and in a sense like in my brain I think of this kind of like binary search because like you eliminate the portion of the search space on a higher level and then you go down on a lower level and do a more granular search to actually retrieve the kind of points because if the higher layer is missing some points then of course you can never discover them for retrieval So it's sort of similar to a skip list index? Yes, exactly I think skip list is probably the better example than a binary search okay one other thing that is super interesting in HNSW that the HNSW paper is older than VBA so when we started this was something that we had to learn and it was actually quite difficult but it's a super simple concept actually this is how do you know which are the right edges and there's a heuristic in HNSW that does that and for this see there we go if we look at this so this is another graph right now this is focus the view is centric of that red point so this is not meant to be the whole graph this is just sort of I am the red point and I could be connected to all of these points and just intuitively you would probably think this may not be the ideal graph because if every single one of those points sort of has a pattern like this then everything is connected to everything which is great for connectivity but also doesn't actually give us any kind of performance boost anymore because evaluating the neighbors of a single point means evaluating the entire graph then it's just the brute force search so probably there is a better way and of course one thing to one way to move around this is just to limit the number of connections but if you limit them you want to keep sort of the most valuable connections and what this heuristic does in simplified terms is basically at some point you would cross the threshold of your connection so you would have this another tunable parameter in HMSW which is called max connections or M short if you keep sort of during build time if you keep identifying your perfect neighbors and you keep adding connections at some point a note just runs out of space for connection so let's say your M is 64 or in this case I don't know it's like 17 or something like that once you cross that threshold you don't want to say like never accept new connections again because then you can never connect to that neighbor instead what you want to do is prune them and the way that the pruning algorithm works is it sort of temporarily removes all of those connections and starts from scratch and starts connecting points again and for this one sort of the four that are highlighted here they are from the perspective of that red point they're pretty obvious connections because there is no way to identify those points through a different point so these points are the closest neighbors to the red point and they are not the closest neighbor to any other point that we already have so those are gradually you could think of this in a clockwise pattern or so and now the interesting thing is every other point on that graphic has a point that is already connected to the red point that is closer to it than to the red point so in other words it was a very complex way of saying from now on we're only going to do second grade connections and all of a sudden we have something that resembles more a graph so we've kind of taken all these points that were previously connected to the red point and prune them in a way that if it makes more sense to connect through them or connect through an existing point to that red point and connect them to the red point itself and this helps a lot in reducing unnecessary connections because you can just rely on your neighbor to discover a point rather than having to be connected to everyone directly and that also means that on average it leads to longer connections because if your sort of search scope is like right now if the red point can have 17 connections right now it only has four so you have lots of space it's sort of a multi-dimensional space it's not as crowded as this two-dimensional space so there's way more room to make a connection to the red point which over time leads to a better graph okay let me see my slides not loading okay that was the hnsw part which things also kind of the prerequisite to talk about product quantization any switch topics any more questions on that can you go back to the previous slide for a little more so every single data point I'm looking at here is an object in the database is it correct like everyone would be like one document yeah you wouldn't necessarily like you could have documents that don't have a vector attached to them but this is every object that has a vector basically would you ever consider constructing an artificial point just to kind of find a center let's say that red point that you have there right now didn't exist it's constructing an artificial point so that you have a very convenient neighbor something you've considered or that people do so super interesting point I think not directly that we've considered this what you can do is sort of if you think of filtering for example like in filtering you remove some points which means that your connectivity drops then you can create these sort of edges again but I think typically what you would do is rather than creating artificial points probably just create artificial edges like edges that wouldn't be part of the algorithm but just say like okay we have these like two distinct clusters and all that's missing is one edge to go from one cluster to the next and now there they'd be connected again cool thank you how frequently do you rebuild this indices as data gets added and deleted I don't know if you're going to talk about that oh yeah very very good question so my slides are currently my slides are a bit slow to load whenever there's a lot of color let me go back to that one so adding data kind of this happens on the fly like h and w is built in a way where you never have a built phase and then a query phase but because the building phase is just searching and inserting if you purely add data there's no maintenance that you would ever have to do like your graph would never degrade over time this is a bit different when you add deletes so for deletes you have two options one option is you would sort of temporarily mark a point as deleted which would be sort of like a lookup list and then you know hey if this point is marked as deleted I can no longer included in the result set but I still need that point because if I just remove it right now I could risk my graph no longer being connected so one option is basically this we wait for those tombstones to pile up and at some point we say we have 30% tombstones now it's inefficient because we have 30% of edges that don't help anymore let's rebuild the graph so this makes a ton of sense if you have bulk deletes or any other kind of event where all of a sudden you have a lot of deletes in a short period of time the other approach and this is actually the one that we do in inviviate or that we typically do in inviviate is to try to repair the graph so what we do is we iterate over points that are marked as deleted so basically we do both so we use this temporary list just to make sure that the moment that the deletes comes in that you start or you still serve the right queries by not including something that was marked as deleted but also in the background we would iterate over that list and we would say like hey if we remove that point right now we're basically removing a connection from so if we take the red point as an example on this slide if we were to remove that point right now these other four clusters would no longer be connected so we then start rebuilding and reassigning them to one another and that is a very costly process so it's kind of that bulk rebuild that would be the more efficient process if you do the on the fly rebuild it works kind of well if you have not too many deletes so if you just from a user perspective like I don't know they upload like 80% of the uploads are inserts and maybe 10% are updates and 10% are delete or so and that happens over time then you can use your compute to sort of drag it out and repair it on the fly which is also nice that you don't have ever have this like sort of big build phase where you would then have to either work on a copy in the meantime or would have to sort of market as read-only in the meantime but you can sort of do this just as the regular process Thank you and do you ever see applications or use or scenarios where if I just look at the vector database forget about the actual data just the vector component of it where the edges that you have across all the layers is way more than the number of nodes or do they tend to be roughly of the same order Yeah so this is controlled by the algorithm because it has that parameter to sort of keep the edges in check so you would typically I mean the point that was last inserted probably has the fewest edges because it hasn't been able to be discovered by other points that were inserted yet but overall the algorithm itself through this pruning and sort of starting this pruning whenever the edges become too much keeps that in check so Thank you I noticed that we're sort of going a bit slower than it had anticipated so we're only starting into the new section right now are we still good for time or it should be Yeah we've still got you know 17 minutes or so so I mean Great we can tackle the product contestation because I think that's the next most interesting one Cool okay so some motivation for what product quantization is and why would need that Here are three of the most common from our perspective embedding models that we see right now the biggest one being OpenAI's ADA2 which has a dimensionality of 1536 so that's typically you would assume we assume these embeddings are float 32's so that's four bytes per dimension so just for a single embedding that's already six kilobytes and therefore for a million of them that would be six gigabytes even more so for Cohere have updated their new default embedding model which is now 4,000 dimensions so that's sort of 16 kilobytes and 16 gigabytes respectively and then SentenceBird, OpenSource, SentenceBird model that's the best ranked on the S-Bird website which is a website from Niels Reimers who sort of trained, he works for Cohere but before that he trained OpenSource SentenceBird models and also has like accuracy benchmarks which is really really great to sort of figure out which works well in which domain and these kind of things and there we have 768 dimensions which sort of leads to with a million embeddings so this is just a million embeddings if you have a billion then those gigabytes turn into terabytes and that's kind of the point where you start thinking like if those vector embeddings need to be in memory which by the way also something that should probably be challenged then only have 17 minutes left so for now let's assume that they live in memory what can we do to reduce the memory footprint of them and one thing that you can do is quantization and product quantization is another one of those where for us it was so hard to figure there's been research for years but to me it was like I felt it was never accessible so what I'm going to try to do in the next couple of minutes is try to make that knowledge accessible because the idea behind is actually not that complex so what we have here is basically just clustering so we decided to take our data set or take the distribution of our data set and try to figure out what would be clusters in that data set that would roughly represent the distribution of our vector so this could be something like a K-means algorithm for example that you could run on your data and then you would come up with these sort of white points that are the centroids of those clusters and then you see that the colorful areas would basically be if it's within one of those areas to the centroid of that particular cell what you can do with this is you can assign just some numbers for now to that cluster so it kind of went in a spiral pattern here so 0, 1, 2, 3 and then at some point I think it goes all the way up to 18 and since we started with 0 that means we currently have 19 of those clusters in reality you probably want a number that makes more sense than 19 but if you use one byte that would be an unsigned integer with 8 bits so that would give you 256 options and that is if we think about the next step of what we want to do with those clusters that probably makes a lot of sense so in real life you wouldn't have 19 but you would have for example 2 to the 8th power options which would be 256 now how does that help us so if we just pick every single vector and we say like hey we assign every vector to one of those then we just have 256 options so for a data set of millions or hundreds of millions of objects that's not a lot of diversity so we need kind of more than just 256 options and what the PQ algorithm does which I think is super smart let me see I already clicked next but the next slide is not loading so maybe try again okay what it does is it splits our continuous length vector into individual segments and then instead of assigning the whole vector to one of those clusters what we do is we assign each segment to that cluster so with this sort of fictional four-dimensional vector here could be way longer but it's easier to fit it on a slide if it's just four dimensions we would say for each of those segments and we have here just one dimension per segment which basically means every segment is just one float we do a similarity comparison using the same similarity metric that we would use so this could be like cosine similarity or dot product or something we use that to define what is the closest of those clusters and then that that would be some kind of a number now you'll see why it made sense to encode this in something like a byte because now we have one of those 256 options for each segment which means that our four-dimensional vector which was four floats previously can now be represented by just four bytes so even just by using one dimension per segment we could kind of reduce this from 32 bits to 8 bits per dimension while keeping the same dimensionality and obviously this is a lossy algorithm we're now using a byte to represent this information that previously was represented in a float so we lose some information but also we just gain 4x compression and we can take that one step further by not saying a segment is just one dimension in our vector we could say a segment is multiple dimensions in our vector so for my slide it was easiest to do just two in reality what we would see is like six or eight works really great on like open AI embeddings and now we just do the exact same thing instead of doing a comparison on a single number what we would do is do a comparison on that like mini subvector basically so from position zero and one we would compare that to position zero and one off our 256 clusters and then in this example this would mean that 83 is the best segment and then same again for position two and three so now what we've done we got the 4x compression from using bytes instead of floats but also we used two dimensions per segment so our final vector is now only half the length of the input vector so all of a sudden we now have 8x compression and again same same sort of caveat as before because we're sort of putting more information in fewer bits we lose some accuracy and if this is all that you do now you see one of those graphs where like the top left corner would be the ideal scenario the bottom right sorry the x axis is query time and the y axis is accuracy so you'd want to have as little query time as possible and as high recall as possible and now you see if this is all you do and I need to move my zoom thing again because it's covering the legend you see the number of dimensions per segment you can see if you just do it do the 4x just the float to byte it's still fairly accurate but if you do two dimensions per segment or four or eight accuracy starts dropping very very rapidly and this is only on I think the sift data set which is just 128 dimensional but it gets worse if you have higher dimensions so from this alone you could kind of argue that this is not really too usable because like 80% accuracy or so is too long typically what you would want is like high 90s 9798 or so is typically what users tend to be happy with so what you can do is see new graphic and simply fix this by over fetching a bit and great thing that we talked about tunable parameters in HNSW because what we do in HNSW just to overcome the inaccuracy of HNSW is we over fetch so we talked about this EF parameter before where you would you would keep 128 candidates even if you're only looking for the top 10 results and if you do that with PQ and then you use your candidate list which is still a tiny fraction compared to the entire data set but if you use those 128 candidates and now you load the original uncompressed vector from disk and just within that top 128 you start re-ranking or re-scoring based on the actual sort of distance you can actually get quite close to what we've had before so what you can see is here sorry I took those two graphics from two different blog posts and they're kind of saying the same thing but in a slightly different format so in this one we actually have the accuracy of the x-axis so further to the right is more accuracy and on the y-axis we have the throughput which I guess you can use a similar way because this is done with the latency so now actually the top right corner is the ideal scenario and what you can see is that for almost every sorry is it a question? No, okay what you can see is that for every value that we have on the blue line which is our uncompressed control compared to the red line ignore the orange and the green for now we can actually find a point on that line where we have the same accuracy just at slightly slower throughput so if we take for example the 0.95 because I can use my mouse if you take the 0.95% or sorry the 0.95 it's 95% accuracy then we can say if you want 95% accuracy you could either do that uncompressed with I don't know 320 or so QPS so QPS just queries per second so throughput measurement or you could use the compressed one and this was done with six dimensions per segment so that's a 24x compression of the vector space you would get all those memory savings and all you would have to drop is to maybe 270 or so queries per second so that's a pretty good tradeoff like you would lose 20% or so throughput but all of a sudden you gained a lot of yeah a lot of memory savings or reduced memory savings yes question so you said the quantization the accuracy gets lower but I kind of lost you how what exactly overfetching and rescoring does to mitigate that yeah so the idea is because you have this compression you kind of add some distortion to the actual distances so you're not comparing the real distance anymore but you're kind of comparing like approximate vectors so also that makes the distance comparison approximate and if you combine that with hmsw for navigating the graph it's actually still quite good like you can still navigate the graph with compressed vectors because having a good enough thing is great to sort of get you in the right direction but then if you calculate if you just stopped there because we had this distortion so close but actually our true top 10 or maybe in position 14, 18, 19 or so so there's still within that slightly larger window that we retreat but if we just cut it off after 10 we actually have I don't know the true position 17, 24 etc so what we do is we take that whole list which is in our case 128 examples which is still a tiny tiny fraction of let's say this was done on so this actually was done on a million vectors this graphic right here so having to load 128 vector embeddings from disk which of course something that can also be optimized for and re-scoring them sort of brute forcing the true nearest neighbors out of 128 is still a fraction of the computations they would have to do if you had just taken that entire data set and sort of brute force and not just as a fraction of the calculations but also to even be able to brute force the entire data set you would either have to have the entire data set in memory which sort of means you also lose the memory savings or you would have to sort of stream the entire data set from disk which would make it way slower because now you're talking sort of I don't know 16 gigabytes or so like even with a fast SSD just sort of streaming that you'd probably be disk bound before you would be CPU bound in that scenario does that answer it? Thank you. Martin, are you able to unmute? Yeah, I can. Okay, yeah, there we go. Okay, two questions but they're both related so I'll ask at the same time first under the assumption that quantization is generally going to be data set dependent in some shape and so if you add or remove elements the optimal quantization boundary is going to change. This also will now impact how much you want to overfetch so two questions in your view or in Weaviates in general how do you feel about creating a huge overfetch boundary so let's say you expect maybe the top 10 to always be in the top 17 25 etc you just shoot for top 50 and you never have to rebuild the index that's question one question two if you do start noticing I guess how would you track this deterioration of quality when you're like okay data's changed too much we now have to recalculate our entire index so how do we balance all of these things with the constant evolution of the data yeah great great question so maybe to start with the first part is you do it is data set dependent that is that is correct and you do need some kind of a sample to come up with a good quantization model basically so what we see in practice is if someone has let's say a billion scale data set they would import small fraction of that so maybe a million or so objects and use that to fit the model for the entire data set and that is surprisingly often already enough so that doesn't sort of take the data shift over time yet into account but it just means like you don't actually need your full data set as long as you have some diversity in the data set so that you don't but like the worst thing that you could do is basically if you would let's say if your data was nicely evenly distributed across this 2d space but somehow you would import in an order and you would first import all the points that are just close to the bottom left corner and then you would fit your PQ model based on the bottom left corner and then you would sort of import the rest that would be a pretty bad model but if you have sort of sampled points that represent the entire space then the only drift that you need to take care of is sort of if the whole space drifts which is less likely so in practice just sort of over fetching a bit more or sort of over fetching in general can go a pretty long way that said and this is I think your second question of how do you measure it that is a pretty difficult one to answer because with these sample data sets it's super easy because you can just brute force the true nearest neighbors once and you can just compare your recall by just looking at your approximate ones where it's the real ones with real data that's a lot more difficult so one way to do it would be you could occasionally do sampled brute force comparisons and just see does your recall drift over time and that could maybe be a good way to identify now it's drifted too much and now we need to do something about it All right, that was fantastic thank you for the insight Awesome, thank you I'm also noticing that we're approaching time I think the PQ section is done I would have another section multi-tenancy but I don't want to go even more over so this could be follow up next semester so if everyone is happy with that Yeah, I mean we've got probably time for one or two more questions from the audience if there's anyone else who wants to unmute Yeah, I have a question ETN you guys are at the bleeding edge of seeing some of these emerging applications like the locus of application categories that tends to dominate what you're seeing is it like text search reimagined with through the lens of vector search or is it all over the place I feel like both like it's all over the place where we see this like wild idea so something like I had a question in a panel discussion the other day about agents where I said I'm super bullish about like autonomous agents not because I believe that we're there yet but because the potential is so great because the like whenever ever since I so at first I was like oh what is this agent things like it's a new thing that now popped up out of nowhere and now a new thing to sort of look into but once I sort of got the general idea of agents just being any kind of automation that can replace sort of more complex human tasks I all of a sudden see use cases for agents everywhere right now so I don't know flight got canceled you get off the plane help desk has a long line because 300 people that were on the plane all need to go to the help desk I see like oh that's an agent probably in the future so that's what I mean with the they're all over the place but at the same time when we're looking at sort of more moving into production we do see a pattern of so stuff like sort of aiding support is super common in the in the rag use case so search through my knowledge base and just have either a chat pod or if it's not supposed to be conversational then just a better search on yeah whatever the knowledge basis so it's a documentation these kind of things document understanding I would say is also kind of that that direction of like we just have too much data for human to browse through and we need to sort of summarize it but also identify what the interesting parts are then retrieve stuff that's related to those interesting parts of where we quickly get into that vector search category and I would say out of the use cases that are moving in production these would be the most common things. I'm going to ask one question we've asked through the seminar series this semester of sort of what is from your perspective your biggest unsolved problem like if you could wave a magic wand and just solve one thing that's just really challenging your team do you have an example or an answer for that yeah so one thing that we see across the category right now is you can even see it sort of in those graphs right here we're always talking about like throughput and accuracy tradeoffs I think no one right now has a good solution for people who say I don't actually have high throughput requirements but I have a massive data set but I only send like one query a day I'm not going to pay the compute for supporting tens of thousands of queries per second if I send one query a day but I still want these insights into the data set and then I think sort of the natural knowing databases and knowing what's happening in the space I think the natural progression to that would now be we need to talk about a separation of storage and compute kind of architecture to potentially go in that direction where you don't pay for idle infrastructure and that's I think harder with vector search because you have these like massive monolithic indexes but also this is something I think that we to be able to really serve all use cases we need to challenge that idea and I think we need to rethink in there we're doing some cool research I think right now to look into those use cases as well to make in a sense that you could argue that it's just making vector search cheaper but I think it's not just making because like compression for example makes it cheaper but it's still an architecture that prefers low latency and high throughput but I think there are so many interesting cases that just have massive data set sizes and the more analytical use cases in nature and I think analytical vector search maybe to put it in a way is a super exciting topic is not solved yet