 The Databases for Machine Learning and Machine Learning for Databases Seminar Series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Google and from contributions from viewers like you. Thank you. Awesome guys, thanks for everyone coming. It's a new semester so we're excited to kick off the new seminar series. For this semester we're doing Machine Learning for Databases and Databases for Machine Learning. In retrospect, maybe I should have called it for AI for Databases because that's more buzzwordy. But nevertheless, the goal is to discuss various systems that are facilitating machine learning AI workloads and then using AI and machine learning to improve database systems. So we're excited today to have Andre Vasenstov, who is the CTO and co-founder of Quadrant, one of the hot vector search engine vector databases in this space. So he's here today to talk about the internals of their system. So for people that are joining us for the first time, the way we want to do this is that we don't want Andre to talk to himself for an hour in Zoom because that's not fun. So if you have questions as we go along, please unmute yourself, say who you are and ask a question at any time. It won't just be interactive for him and not just, again, speaking to the void for an hour. And again, we'd like to thank Google also too for sponsoring us again this semester. So if you're not able to unmute yourself, you can send a message in chat and I'll try to stop and ask it. Okay? All right, Andre, thank you so much for kicking us off in the new semester. It's always a good turnout. So the floor is yours. Please go for it. Yeah, thank you, Andy. And thank you for having me here. I'm Andre Vasenstov. I'm co-founder and CTO at Quadrant. I started Quadrant as a big project about three years ago, basically to solve my own problems at my day job, which later turned into a dedicated company. That's what I'm doing now for full-time. All code in Quadrant goes through my reviews. So you can assume I know it pretty well. I work in the search related fields for already almost 10 years. I started my career as a backend engineer at a web-scale search project, much similar to what Google does. I was working on very low-level components of the search engine, like indexing and storage. Later I switched to more high-level components like query parsing and ranking. I used a lot of machine learning in my job. And at some point we started to use vector search in our projects. And that's basically where I got the idea to build a search engine which is specialized on vector similarity. So what is Quadrant actually? Quadrant is a vector similarity search engine or as our marketing department make me also call it a vector database. But personally I don't really like to vector database and prefer search engine as more accurate one. And I hope in this presentation I will be able to explain why exactly I think so. Quadrant is fully open source. It's distributed under Apache 2.0 license. We have more than 60 external contributors on GitHub. And yeah, please, you're welcome to contribute in Quadrant as well. Quadrant is written in Rust as I think system-level software should be. But yeah, that's a topic for another talk. As for this talk, I propose to cover the following things. First, it's short but mandatory introduction into what vector search actually is and what Quadrant plays like which role Quadrant plays in all of this industry. Second is architectural overview of Quadrant, how it works on a general level, general idea. And I also wanted to clear some terms before we go deeper into vector specifics. And the third topic is especially this vector specifics. So what makes vector databases or vector search engines special? What challenges we faced and how we solve them? And yeah, which technical decisions we made in this process. Yeah, but let's start this first overview of what vector search is. As an approach, vector search is not new. It was used by many big companies for quite a long time. And principle didn't change much since then. The basic idea is that you have some model or we also call it an encoder. Typically it's a neural network. It can convert some input data into a dense vector representation. And those vector representations are also called embeddings. They have very interesting property. A pair of vectors in vector space, which are close to each other, is usually corresponding to objects which are also similar in some sense. In case of texts, it could be a similarity of the meaning. In case of images, it can be visual similarity and so on. And it's actually defined by the model, which type of similarity we want to catch with those embeddings. And also distance function between vectors is also defined by the model. But in majority of use cases, it's just a simple production between vectors. So although this is not new, what is new is how available those ready to use models become and tools around them have become to a wide audience. So now anybody can go and download a pre-trained model and use it without any knowledge of machine learning. So the next logical step actually is to bring this technology to production levels. So not just scientists, but also engineers and software developers can work with it. The next step is to provide the level of convenience and reliability as it exists in such familiar tools, like text and search engines, like traditional databases, and bring this convenience into the vector search world. And that's what actually we do in Quadrant. So let's now take a look at how Quadrant actually achieved this. So here's the top level overview of Quadrant components. Each level in this hierarchy represents some special kind of isolation. On the top level, it's a collection. Collection logically isolates types of data between each other. It's similar to table in relational databases or collections in document databases like MongoDB, for example. After collection, it goes to the next step, which is shards. Shards isolates subset of data between each other. It is guaranteed that shards contain only non-overlapping subset of records. And shards can be moved between nodes. Shards can be replicated for higher availability and so on. And finally, the lowest level of isolation is segment. Segment isolates index and data storage. So each segment is capable of performing all the same operations as a whole collection on its own, but just on a smaller subset of data. And yeah, later I explain why I exactly do the segments and how they work. But before that, a bit more about top level. So you may notice that on the top level, coordinate architecture is pretty standard. We have collections which contain shards. Shards can be replicated and moved around nodes. We use Raft consensus protocol to keep track on this metadata, like where a collection is located, what is the node status, what is the collection configuration, and so on. So all of this metadata is stored in the distributed consensus way. And it's also pretty standard for many distributed systems. And this is actually because quadrant is a typical example of a system built on base principles. And base here stands for basically available soft state eventual consistency. It is typically compared to ACID principles, which are more common to relational databases. So if you want to have an intuitive understanding of what is base and what is ACID, you can think about Postgres and Elasticsearch. So Postgres is a typical example of ACID database. It has very strict transactional guarantees. It cares a lot about consistency of the data. But the scalability of Postgres is actually very limited. It's basically up to the size of the single machine. On the other hand, Elasticsearch is a typical example of base system. It is highly scalable, but consistency guarantees are much weaker. And that's actually why I prefer the term vector search engine rather than vector database. For systems like quadrant, scalability and performance, in my opinion, is much more important than transactional consistency. So it should be treated as a search engine rather than database. Ideally, I think it should not be even used as a primary storage of data, especially considering that the full update of vectors, like due to, for example, a new version of the model, is just a common operation in vector database world. So you need to wipe the whole database and create a new one just because your encoder is changed, right? And it's something which never happens in traditional databases. Yeah, so what happens inside Elasticsearch is a bit more interesting for this talk, because it's actually have specifics related to vector search. On the top level, we see a pretty standard component, right to headlock, which is responsible for the fact that quadrant doesn't usually lose data once it's committed. Yeah, again, it's pretty standard component for any database. What is not standard is the second level of data segregation inside the chart. And this second level is segments. So why do we need many segments in the first place? Yeah, why can't we put just all data into one segment and that's it? And there are actually reasons for this. First reason is mutability. And in quadrant, we actually love immutable data structures. But the simple assumption that structure is only built once and then never extended actually opens room for many different optimizations. Data structure can become more compact. We do not need to jump between different locations in memory. Therefore, less cache misses. All data statistics is known in advance. So based on that, we can perform various optimizations as well. Like for example, we can pre-compute histograms. We can pre-compute data distributions and so on. Of course, in this case, we can also allocate the exact amount of memory we need. So we don't need to worry about memory fragmentation as well. Loading immutable data structures is also much faster because you don't need to perform any kind of deserialization. You just copy a row chunk of memory from disk and that's it. Or you can even do memory mapping even faster. On top of that, we can further compress data using such techniques as delta encoding, as variable byte encoding and so on. And this total effect, total combined effect of this optimization can make immutable data structures an order of more efficient than mutable ones. The second reason actually is trade-off between latency and throughput. And the reason for this is that concurrency of the single request is only efficient to a certain point. And the closer we are getting to low-level index, the less efficient concurrency becomes. So that makes segment a natural unit of parallelization important. For example, if we are dealing with an application which requires a very low latency for a single request, we can optimize CPU utilization by just assigning each CPU core with one segment. And in this way, it will utilize as much CPU as possible in just one request. On the other hand, if we have an application which is dealing with high throughput and it requires to make a lot of parallel requests, we can have a single large segment. And in this case, it will maximize the throughput of the whole system by just serving each request on a dedicated core using the whole segment in read mode. So we have a question from Rohan asking, what is the segment size? Any segment size is any set. What is the default then for you guys? Default is zero. If you don't have any data, it's zero. If you put some data in there, it's growing to the extent where... What we prefer to configure is not the size of segment, but how many segments do you want to have on your machine? And you can go as big as you want if you have enough resources for this. So according to this logic, suppose you have a number of segments, what you're trying to say is that's the max. There could be all storage utilized by one segment itself. That can happen. It is possible, yes. If your application is very read heavy, if you have very low amount of write requests and you prefer to maximize your throughput, yes, it's possible to just join all data into one segment and work with this. Okay. Thank you. There's somebody else. Next question. Is there any advantage to having segments of different sizes within the same shard or collection? Right. So there is no advantage in this because we prefer to have even distribution. But in practice, it may happen because segments are joined like during the insertion of data, segments can be joined together and you always need to have at least one segment which is not immutable where we can put new information into. So in practice, it's not happening, but we prefer to have them as even as possible. Any other questions? Somebody else. Is the segment sizing configurable at runtime? Size, no. The amount of segments, yes. Let's keep going because let's get to the actual vector part. All right. Well, there is one slide about segment management though. So the thing is, we have a lot of segments within the shard and some of them are immutable. Others are just used to insert a new data. But how do we maintain the illusion for the user that actually the whole collection is fully mutable and in quadrant user actually can insert, delete, update any data at any time. So ideally, users should not even know that this segment exists. It's pure internal thing. And in order to solve this problem, we need to actually solve two problems. The first is how to update data and immutable data structure. And second is how to even obtain the immutable data structure in first place. So the first problem is solved by simply employing a copy and write mechanism. So whenever a user inserts a new data into or changes data in the immutable segment, we just copy this piece of data into immutable segment mark. It is as deleted in the old segment and everything works. The second problem is a bit more complicated because we need to perform long-running optimizations on a segment. So index building is quite long. That's why we need to keep this segment available for read and updates for the user. At the same time, we need to build index from it. And in order to do it, we use so-called proxy segment, which is a special type of segment that wraps in one under one interface. It's a segment which is currently being optimized. And it also holds a list of modifications. It needs to apply to resolve conflicts which is happening when you copy data from old segment into a new one. There's like special data structure which manages all these insertions. And when optimization is done, it converts back into just regular segment, into pair of segments, actually, into optimized one and into a small copy and write segment which just becomes immutable segment. Yeah, inside segment, there are some bunch of abstract components. I intentionally am not going to describe how exactly each of them works on a low level. Instead, I will focus on vector index mainly. And the reason for this is that the choice of concrete implementation just depends on configuration. For example, vector storage, we have at least three different implementations of vector storage in Quadrant. And it's quite possible that in recent future, we will add the fourth one. And it's not really important. Quadrant works able to work with any abstractions. It can be just files. It can be memory storage. It doesn't really matter. But this matter is vector index. And let's finally talk about the main component of the vector search. The vector index, there are two main traits that distinguish vector search from traditional indexes like inverted index or P3 or stuff like this. First, it's approximate nature. It says that it does not guarantee that the result will be exact or even that the result will be the same for the same underlying data if you build the index multiple times. Because the result actually depends on the order in which you insert data into the index. So it's quite fundamental. And the second is that any vector can be the result of any search request. In other words, any documents in your collection is somewhat similar to any other document in your collection. So it's impossible to draw a clear line between relevant and irrelevant documents based solely on the vector similarity score. And of course, there are a lot of different approaches how to implement vector index, but no matter which approach you choose, all of them will have to deal with these properties I just described. And actually it breaks many assumptions you made in a traditional database. And therefore I believe it actually requires a special treatment and special dedicated architecture around it. In Quadrant, we use so-called HNSW index. Here HNSW stands for Hierarchical Navigable Small World. But yeah, it's a pretty complicated name, but I will try to explain in a very simplified version of it just to provide some intuition about how it works. So internally, HNSW appears as a proximity graph. That means that each vector is represented as a node in a graph, and those nodes are connected with some number of closest neighbors. So some number of other vectors in the database. And search in the proximity graph is performed in greedy manner, meaning that on each step, we choose the closest node to the target and then repeat the search step with this new selected node. The process is repeated until we can no longer improve the distance between node and the target. And of course, there are no guarantees that this search will result in the absolute closest vector. That's why it's called approximate. But we can control the precision and trade off precision and speed of the search by changing the beam size parameter of the search. Yeah, but of course, HNSW index brings its own challenges. First of all, building time. Insertion of a new vector into the index is approximately two times more expensive than just searching index. And it's also very CPU intensive on its own. Therefore, if we do want to index and do not affect our other processes, like if we do index and search at the same time, it's necessary to have a dedicated thread pool just for building index in the background. Or ideally, we might even want to move this index building process into another machine completely. HNSW index is not only CPU intensive, but it also have a random data access pattern. That means that in every... That means every... It's very sensitive to the latency of the underlying storage. And techniques like prefetching, like block reading are not really efficient with HNSW. That's why it usually requires a lot of RAM and doesn't really work with this disk. And moreover, the pattern is not only random, but it's also sequential. Remember, we go from one node in a graph to another. That means that we cannot... We cannot do efficient parallelization of the search and its performance is mostly limited by the latency of the storage rather than throughput. To overcome these challenges, the quadrant we do the following thing. So the solution is to use a compressed memory representation of vectors and use it to generate selection of candidates. For instance, one of the latest addition to the quadrant engine is binary quantization. It allows us to compress vector to a level where a single dimension is represented by just a single bit, which gives total of 32 times compression for the vector. And on top of that, it allows us to use very fast CPU... Basically, it allows us to compare vector in just two CPU instructions, bitwise XOR and popcorn. And it works especially efficient with large vectors, like the one which provided with open AI models, for example, where we have 1,500 dimensions for a single vector. Yeah, after obtaining this list of candidates, we can rescor them using the original vector and return the final result to user. And it's important to know that this rescoring process, unlike traversal of the HNSW graph, actually can be efficiently parallelized as we already know all the offsets, all the IDs of candidates. So we can leverage here the synchronous IO and take advantage even of the slow network mounted disks with huge latency. So you said it's rescoring, so you take the quantized vectors and then what do you do it against the original vectors again? Right, so we do search. Like we do usual search using HNSW using quantized representation. So it's very fast, very memory efficient. And then after we obtain, like let's say, top 100 results, we do rescoring using original representation. We do need to fetch those 100 vectors from disks, but while we do this, we know all the offsets, so we just make a parallel request to disk with asynchronous IO or something like this and just get all the original vectors pretty fast. Yeah, so you're compressing the vectors into the quantized form. That makes the index look faster, but do you still have to do the final comparison to see whether true rankings should be? All right, there's a bunch of questions in the chat. So the first question, is this equivalent to what disk A&N does? I don't know what that is, but assuming it gets... Yeah, disk A&N is a bit different. So I don't think disk A&N does any kind of quantization. There are other implementation which does. So I don't think this approach is somewhat like revolutionary, but yeah, it's parked into the box where you can use it without any additional configuration. You don't need to pre-quantize your data, your vectors you just upload everything in quadrant and it just works out of the box. So quadrant takes care of this quantization itself. It takes care about the scoring and you get the result. The next question we'll hold off to at the end. They're asking basically how many vectors can you put on a single box? Well, that's an arbitrary question. We'll come back to that later. Keep going. Yeah, all right. So another challenge which is associated with HMSW is a requirement to combine vector search with additional filtering conditions. For example, you might want to search for some kind of item in e-commerce store and the price of this item should be less than $100. And we want to search it on a specific location like near you, find something near me, some stuff like this. So those additional conditions are really necessary for real-world applications, but vanilla implementation of HMSW or any other NN algorithm just doesn't have this. And in some publication, you might find that there are two ways to solve this problem. One is either post-filtering or pre-filtering. In post-filtering, it is suggested that we can perform a regular vector search and then apply filtering conditions later on top of the result and exclude those results that just do not match filtering criteria. We might need to repeat this process several times until we obtain the required number of results. This approach is quite simple to implement, but unfortunately, it's very inefficient, especially if the filtering condition is very strict or if it's correlated with vector similarity score itself. So this approach basically risks to either turn the whole search into linear scan or return incomplete results eventually. Another approach is other way around. It suggests, I mean, pre-filtering. It suggests to generate a list of candidates and based on this list of candidates, we can perform a vector search. And the problem with this is that generating a list of candidates on its own might be a very expensive operation. Worst case scenario, it might require to check condition for half of the vectors in the collection and that can significantly increase the search latency on its own. So what we propose an approach that is used in quadrant we call in place filtering. That means that filtering condition is checked during the graph traversal. It did require us to make some custom implementations, custom adjustments to HNSW. So we don't use vanilla implementation anymore. We use a custom implementation quadrant. But in this way, we can ensure that we only need to check filtering conditions the amount of times that you actually needed to perform the search. And like it looks like a problem solved, but unfortunately not. The problem arise when the filtering condition is so strict that the graph become disconnected. That means we cannot longer find a path between the entry point in the graph and the section which actually contains the desired results. And in mathematics, there is a whole field of study called percolation theory, which is dedicated to this special type of problem. And for large random graphs, this theory actually gives us surprisingly simple equation that defines how many nodes should be removed from the graph to make it disconnected. And equation is simply one over K, where K is the average number of connections per node inside the graph. In other words, let's say if we have 10 connections per node, after removing 90% of nodes, the graph will become disconnected. And yeah, let's see how it looks in practice. In practice, this plot shows the precision of the search in relation to the fraction of vectors being filtered out. And you can see that after a certain point, the accuracy drops almost to zero. And there are several experiments which demonstrate that this effect actually depends on number of connections per node. Indeed, it's very close to theory. So what we can do about it? To address this problem, we can leverage the fact that filters we want to apply to the search are not actually random. In most cases, these filters are based on some metadata or payload associated with vectors. As you can remember, we have a price next to an item and we know price in advance. And we not only know the price in advance, but we also know that what price is associated with which vector on the index building stage, because we use immutable data structures. And what we can do is to build additional links that we can generate based on the existing pilot values and possible filtering conditions. For example, if we have a pilot with a keyword field, and we can build subgraph for each value of this field and then merge this subgraph for the keyword into the main one. And that's how we can create these additional links. And these links will ensure that when we apply a filtering condition with this keyword, our graph will always stay connected. No matter how many, how strict the search is, we can guarantee this. And this approach... A quick clarifying question. On the non-vector components that metadata, is this search on that guaranteed to be exact or is that also approximate? Yeah, if you are talking about this type of queries... So it's a combination of vector search and filtering. So you are performing approximate search within this condition. So it still stays approximate, but we can guarantee that the condition are satisfied. Got it. So at a high level, you're probably vectorizing those additional attributes itself and just creating a gigantic link to index number. Now, vector stays the same. We keep a payload next to the vector. It's stored in the quadrant. We perform this search. And on each step during the search, we check the condition. So we don't go into those nodes, which are filtered out. So it sounds a little bit like post filtering, right? You'll still do approximate nearest neighbor search following the HNSW graph on the vector and only on the additional metadata you'll do an exact search. Am I getting that right? No, not exactly. The difference with post filtering is that we apply it, not after we generate the candidates, but during the search. So in this case, we can guarantee that we will return exact amount of results we request and all of these results will satisfy the filtering condition because we restrain our search procedure with this condition and we do not let the search process go into area where the condition is not satisfied. Okay, thank you. Sorry, a quick follow up on that, I guess. Would quantization affect the filtering? Like would we accidentally filter something that could have been satisfied if we had the uncompressed factor? No, no quantization is only affecting the precision. It may affect the precision of vector search itself. Filtering is completely untouched. Okay, cool. Thank you. Any other questions so far? So we were talking about building the additional links in the graph. And, yeah. So the good side of this approach is that it doesn't actually increase search complexity. Even so, we increase the size of the graph, we can still perform the search only using the original links where it's needed and utilize extra links only when original links are being filtered out. So the total complexity of the search is not affected. We can index as many additional links as we want. We can index many additional fields. Yeah, that's another point, actually, is that this approach is compatible with multiple fields being used for filtering at once. Because when we do that, we merge all the subgraphs into the main one and we basically can deduplicate links at this stage. So only a fraction of additional memory will be actually required for this. And search speed won't be affected. What matters? So Sid asks, does quadrant do any cardinality estimation to determine query plans? Yes, it does. Unfortunately, I don't have slides for this, but I'm happy to answer those questions after the main topic. Awesome, thanks. I had a quick question. If I had something like customer ID where I know those things are always going to be separated out, would you recommend doing different collections or is there something else to recommend filter? Right. It's the most frequently asked questions in our Discord community. By the way, feel free to join it. I'm answering questions there as well. So what quadrant can do is to build HMSW graph only based on payload. So we can skip building this whole main graph for all points and only build subgraphs for the specific user IDs. In this case, we can still have search performance within user, right? It's all very fast. We can still have full scan ability to look through the whole collection. And at the same time, we do not have to spend as much resources on building vector index for all points altogether. So yeah, it's like a solution which allows you to do like semi-segregation of data. They will be still stored in one collection, so no overhead for creating many collections. And you are able to perform searches on the subsets of your data. Anandri, feel free to defer this question if you're going to talk about it later. Are you going to describe what's the complexity of creating this HMSW index, both as the number of records grow and as the dimensionality of each of the vector grows? Right. So I will start from the last question. Dimensionality of vector affects how fast you can compare a pair of vectors, right? So it is basically linear in this case. If you have, let's say, 1,000 dimensions, it will be two times slower than 500 dimensions. Complexity of graph search itself and building of the graph is like there is no exact estimation for this, but you can say it's approximately logarithmic. And the process of indexing actually involves search. So in order to insert the new data, you need to first search for its neighbors and then perform changes in the graph. So overall, yes, it's quite expensive to build large indexes. That's why there are some optimizations that we are trying to do on this as well. That's why you might want to have multiple charts and multiple segments to limit this overall size of the segment. Yeah, that's a good point. And what you just said, can these links cross across segments or sharps? Sounds like not across, but across segments. No, no, these links are isolated to one segment. One segment. Okay, thank you. So what we do when we do a search, we basically ask each segment individually and then merge the result of the search. Similar approach which is used in all other, even text search engines, it's pretty standard thing. Yeah, but yeah, thank you for asking. Maybe it's not clear for somebody who's not deep into this topic. Yeah, so one, yeah. Sorry, can I ask a question about searching across segments? So I was wondering, are the segments just random? Are they based on like some sort of spatial partition? So points in the segments are likely to be close to each other or like how's that approached? Right, so in our implementation, segments are completely random. Each time we insert a point, we just toss a coin and put it into first segment which is appendable. Yeah, we do not do any kind of clustering inside. Mostly because the clustering actually depends on type of factors and it depends on model and we cannot make these assumptions in advance. It might be a good approach if you're building like a dedicated system for some application where you know which encoding model you're going to do, but for general purpose, it's unfortunately not really working. Hi, sorry. How are these expected filtering conditions determined? Are they predetermined in a sort of like prepared query fashion? Yeah, so what we have in advance is we know what payload user uploaded into quadrant along with the vector. So usually it's in a form of just JSON document and user can specifically say which fields should be indexed. Like let's say we have a product, we can e-commerce store, and we have a long text description, name, price, and category tag, something like this. And it's usually only price and category tag are going to be used inside search filtering conditions and that's why user can specify that those two fields should be indexed. And once they are indexed, we know that they should be included into this process of building additional links in HNW. Cool, thank you. Anything else? All right. So one important thing about this additional payload and associated payload indexes is the data type of payload does matter. But in most cases it is possible to come up with a strategy to cover filtering conditions with a subgraph. It's especially interesting that we are talking about, let's say numerical field where we don't have exactly a keyword which defines a strict subset of points. But in this case we can, for example, what we do in Quarant is we build a subgraph for overlapping intervals. We know that for some intervals, which interval covers how many points and we know this minimal threshold of how many points should be in the graph. So we build overlapping intervals and we build these subgraphs with additional links for those intervals. And also we can do the same with, let's say, location data where we have geo-coordinates. So latitude, longitude, we encode it into geo-hashes and basically build these additional graphs for overlapping geo-hash regions. And yeah, the result of this is that we can close the gap in filtering precision from two sides. From one side we close the gap by introducing these additional links. And from the other side, we increase the connectivity of the whole graph. And in most practical use cases, the drop of precision is not noticeable next to the fundamental approximate nature of HMSW itself. It is just like a noise and no signal. Well, actually it's all topics I have presentation for. Just for those who just joined, here's a short summary of what I try to cover. So first of all, vector search is only getting started. I believe that there are many more interesting use cases beyond just text search or memory for chatbots. Second point is the quadrant is a search engine and should be treated as such. Architecture of search engines is fundamentally different from the architecture of databases. And it should be taken into consideration when you are designing your own application. And finally, vector search index is a pretty specific component and it requires special treatment even for such usual operations as filtering. So what is out of scope of this presentation, but I'm happy to answer questions about it and even beyond this talk is query planning and payload indexes. How do we organize it inside the quadrant? Interesting topic is dynamic search limit per segment. So as I mentioned, we have multiple segments and in order to get the whole search result, we need to go into each segment and ask for some amount of results in each of them and then combine. But if we want to search for the very large limit, it might be expensive to do. We have some optimization for this. And yeah, more quantization. We currently support three versions of quantization. It's binary quantization, the most recent one, scholar quantization, product quantization. We help all of this. Yeah, that would be all. Thank you. And we are happy to answer questions. Awesome. So I will applaud behalf of everyone else. If you have any questions, again, unmute yourself. We have about nine minutes left. Go for it. Maybe I'll ask a question. Andre, that was a really nice talk. There are two different grades which people are going about, putting vector search and data-rich applications. One is where you start with some sort of a relational code like PG vectors and you put it in Postgres. The advantage there is that if your data application needs to research on these additional attributes, regular columns, you can get very deterministic semantics on the search for that type of metadata. And of course the disadvantage is on the vector search, you can't build these quote-unquote global indices like HNSW. Do you have some thoughts on what kind of applications from your perspective fit which style of embedding vectors along with relational type of search? Or you think this approach, your way of doing it, or do you think there might be one way that basically captures everything that will emerge? So regarding Postgres and relational databases, it's what I was talking about, difference between base and ACID type of systems. So Postgres, as I mentioned, is a typical ACID database. It have very strict transactional guarantees. It have very strict consistency. But the scalability of Postgres is very limited at the same time. So in your application, you should decide what is the most important thing you want to do with your vectors. Well, probably vector search and vector database is not a good choice if you're building some kind of banking system where you have money transferred, right? At the same time, maybe elastic search Underneath that, I have a question. If I have an application in which I need to do both relational search and vector search in the same application? Well, if you, by relational search, you mean some kind of joins? Yeah, not even joins. Imagine I have a big table. I want to search on different columns in different ways, depending upon the user query. And I've got a vector column that is part of that search. So my search may involve finding everything where the student ID is in this range, the age is in this range, and then there is some vector representation of maybe some descriptive stuff about them. I also want to search on this. I want to mix exact search and approximate search together in my end application. Well, yeah, that's what exactly these filterable queries are doing in Quadrant. And well, the answer for this is, it depends if you are satisfied with scalability limits which exist in Postgres. In most cases, I think Postgres will do just a full scan if you try to combine these relational filters and the semantic component. So if you are satisfied with scalability of Postgres, yeah, you can do that. But also I mentioned that search engines are rarely the source of truth in your system. So what usually happens is you have your Postgres and you also have some additional indexing engines. You can have elastic search for text. You can have Quadrant for vector search which do not hold maybe even very up-to-date data, but they can perform specific types of queries very, very fast and scalable. And you can still have Postgres in your system as well. So it's not excluding each other. Thank you. Avery, you want to ask a question? Yeah. I was curious about this addition of new links things. How do you decide how many new links to add? And also in practice, how much does this blow up the graph? I was kind of curious if you have any predicate that is any attribute that's even relatively high, like 100 values, let's say, you would end up adding a lot of new edges in order to make it work. Yeah. So let's say he's hiding his name, I'm all despondy. The extinguished data is a fessor. Just as AD. All right. So we allow users to configure this value, this number of additional links in the configuration of collections. But by default, we just use the same amount as the original graph have. And overall, it's not usually a big problem because as I mentioned, those links are being duplicated. So if some links exist in original graph, it won't be explicitly added on top of this after the merge process. That's the first thing. Second thing, overall, HMSW you have a special heuristic, which allows to exclude redundant nodes and so on. So even if you try to configure HMSW, let's say 1,000 neighbors for each node, it won't actually build graph with 1,000 neighbors because of this heuristic, which will just prune redundancy. Yeah, it's not included into my slides, but it exists there. Just for simplicity, I try to avoid these very deep descriptions of HMSW. Thanks. Avery, you want to go next? Yeah, sure. So I want to ask how do you think of HMSW plus convolution compared with the scenario? Because as we all know, currently, the scenario also has some variables involved in filtering. Sorry, I didn't get the question. She's asking how does your approach with HMSW plus the quantization compare with Microsoft's disk ANN? Yeah. All right. Thanks. So as far as I understand disk ANN implementation, it doesn't use quantization in the first place. So the idea of disk ANN is that you have a data structure similar to HMSW, but without any hierarchy. So it's a plain graph, much like I have on my slides here. And the difference is not exactly in how you search in this graph, but rather how you build it. So HMSW assumes that when you insert a new point, you just create a new links for this point in the existing graph. But disk ANN implementation, on the other hand, it works in other direction. It builds a fully connected graph and then prune it. And I may be wrong here, but as I understood, disk ANN is something like this. So disk ANN doesn't use any kind of quantization. It doesn't have in memory storage. And actually, if you think about it, here there is no limitation on which kind of indexing algorithm we are going to use. We can use this quantization plus over sampling thing with disk ANN approach as well. So it's something on top of vector index. It's agnostic to vector index type. All right. Alessandro has a question. Yes. First of all, thank you for your talk. My question is from the point of view of a possible user of your application, so not too much technical in the details of the engine. I can get it for image and maybe text, but do you have an experience with time series data? So maybe to check a similarity within time series and what are the challenges in this field if there are any? Yeah. So, well, in Quarant, we do not exactly work with embeddings. Quarant assumes that embeddings are already made from some external model. So I would like to say the short answer to your question is, if you have a model which can translate time series data into a vector, then it's going to work with Quarant as well. A bit longer answer to these questions probably going to involve like, it's going to depend on what type of time series data you have and how can you work with it. The most common thing which I have experienced with and which is a bit closer to time series than just text is user events or user behavior patterns. So for example, if you have a history of user transactions in like a banking system or something like this, you can build a model which tries to predict next transaction and it can do this prediction through the vector embedding. So basically, you will have a model which can represent a series of actions as a vectors and an event in this system will also be vector. So it's very close to, for example, how Word2Vect works. Word2Vect tries to predict next word in the sentence the same way we can train a model which can predict next event in a sequence. So we can use this vectors from this model in Quarant in vector search. Okay, thank you. One last question. So this is the question I'm going to ask all the speakers this semester. In your opinion, what's the biggest unsolved problem you face in your system? Like if you had a magic wand to fix one thing, what would that be? In our system, making it cheap to serve billion vectors. It's one thing like our milestone for the engineering team, I guess, for the next year. So right now it's quite expensive. You need to have even this binary quantization and stuff like this. You still need to have a lot of memory. It has a lot of problems with this random access and so on. So our main challenge is to make it cheap.