 The databases for machine learning and machine learning for databases seminar series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Google and from contributions from viewers like you. Thank you. Welcome guys, we're here for another talk and series. We're excited today to have Lee Liu. He's a principal software engineer at Zilis where he works and helps build the middle of a vector database system. So as always, if you have a question for Lee as he's giving us talk, please unmute yourself, say who you are and find your question off at any time. We appreciate him calling in from China to give this talk where it's currently 5.30 a.m. So it's not the record. I think that the latest we've ever had was somebody in India at I think 3.30 a.m. But still, it's very early for him. We appreciate him getting up for us. So again, Lee, thank you so much for being here. The floor is yours. Go for it. Okay. Okay. So, hello, my I'm from the first I want to express my sincere. Yep. Actually, I say, Fred, you're distinguished senior alumni. It's very terrible. I meant to forget to mention this. Lee is senior alum. So we appreciate him coming back to talk to us. Thank you. Thank you. So, yeah. Okay, I'll just start. First, I want to just express my sincere gratitude to Andy for inviting me to revisit same you. I spent quite a happy time years before. So, so glad that I can share Miller's technology and as well as our industry insights with all of you today. MLC is one of the world's most famous vector database I'll say, and it's the most popular open source vector database with very super active community. It is actually the world's first vector database with a five year history. Yeah, five years. It's quite a long time. Well, probably doesn't sounds like a very long time comparing with something that the history of my single and HD H base this kind of thing. This is because vector database is a brand new field. So many design and the concept in the vector database area have had not been thoroughly tested yet. So always keeping this in mind as implies that on the other hand, there's still a lot of work to be done and plenty of opportunities. So in today's talk, I would like to discuss a design principle within Miller's architecture hope that it will serve as both informative and inspiring. And yeah, something about myself. I'm from the list and principle from engineer from the list. Okay, before digging into the specific content, I'd like to leave you with a question. So what is the relationship between traditional search and vector search and what is that off between traditional database and vector database. I'll put a video here as a hint and we will revisit these two questions to work the conclusion of this talk. Yeah, so I get a Turkey cardboard. Is this just the audio sorry for this. Uh huh. No, it's just a video. I'm sorry. I didn't put any audience. So yeah, this time we get a pool and that's the one try to get a glass. Because it is pretty cool. And you get the glasses. All right. Yeah, that that is a small video just about to start to start everything up. Well, we'll cover three aspects into this presentation. So first, I will give a brief introduction of Miller's general architecture where you where you may notice the uniqueness of Miller's architecture compares other similar products. And then I will dig into the design of an important component of Miller's the right part of it. And this should this shows how Miller's is designed with focus on machine learning. Last but not least, I want to discuss our challenges and solutions in your border context of machine learning booming. Not only about the perspective of database designed for machine learning, but also from how machine learning booming can can can can promote the vector database. All right, let's start with introduction Miller's first. Before talking about the vector database, let me first introduce the concept of vector search. I've been previous talk previous talks have already covered this. So basically, vector search is about finding similar vectors. Giving a query vector. For example, the key algorithm is one of the most famous one of the famous vector search algorithm. And now let's let's get more practical and think about scenarios of how vector search is applied to our day life. We can even search is a classical example. Just like one day I came across my picture of our this memorable landmark. I would like to find more pictures of from different perspective, but then walking to a sky was a tip of my tongue. So I simply drag this picture to into Google, just search it and got so many pictures about walking to a sky behind us to be behind the scene. If it functions follow. So first away, a deep learning model extract the embedding of the first work to a sky images as its presentation vector name. We search for similar vectors in database vector similar similarity often represent the similarity of this corresponding unstructured data behind it. So in this case, it's similarity of the images. That's how we got the similar similar walking to a sky pictures. And then we they will get everything. Yeah, and actually be besides image, besides image vector search is applied to various of type of structure data like audio and documents and more videos. Alright, so when we talk about vector search, we can we cannot do this without mentioning face. This is a library at Facebook open sourcing 2017 with many sorts of vector search algorithm inside it. So here's a big question. Why on earth do we need a vector database like nervous that also focused on a search where we already have already got something like face. So they also do the same thing. So let me try to break this down for you. So first of all, face doesn't support delete and update this kind of thing. So I believe this very common operations inside for any database. And when things goes wrong, when the system fails, face will be left hanging. It doesn't it doesn't have the persistence persistent to pick up the pieces. And face act as ugly the library is not built for handle to handle the huge pile of data since it doesn't have it doesn't support complex the distributed display deployment. So besides this, we still have so many other things we need to do to we need for production production DB or use cases like resources management monitoring the data backup and so on. Okay, you may have noticed that so a vector database actually consists of vector search algorithm and database features face and the meal was similar to similar to the relationship similar to you know to be and my cycle. So how does we was combined this algorithm features and and what makes it special. So first meal was a distributed architecture allow it to handle a huge amount of data. Then I was a meal was has multiple level insertion structures that provide a capability to balance the data freshness and the efficient efficiency at the same time I will I will introduce more about this later. Also meal was is equipped with many data driven optimizations with two performance based on how the data is spread out. And finally the batch processing capital capability are supported to enable us to operate massive data and do some global optimization. Quick question do you do you guys like is the big vision because you won't have fast ingestion and do delete and updates do you see middle of this as being the databases of record. Or do you see it as like this, like a elastic search kind of thing where like, you have the primary database, then you stream updates to middle of this and then you do all your vector searches on that. We have both kind of this kind of I saw a lot of usage like you have a database and you're streaming your data into a very recent research and also very recent but meal was sometimes also get used to support this kind of research itself so without this backup data database. It's not a search engine database. Okay, awesome. Thanks. All right. Okay. So, yeah, so meal was latest version is 2.3.2. Yeah, back to meals 1.0. It was actually a single node architecture. So it just simply add some basic database features to a vector search algorithm like a face meals 1.0 many includes the four main modules proxy storage index inquiry. And it can handle tons of millions vectors very easily. So which perfectly satisfy the requirement we've received at that time. But luckily, we were able to foresee the huge amount of data we will got we've got today and get prepared earlier. Two years ago, we began the shift of working a shift to distribute architecture. So each of the four modules are pulled out and transformed into individual distributed module and with a master slave pattern. So then we introduced a message queue into meals like Kafka or Posa and to decouple each modules. And now we got most 2.0 our current current architecture. When we insert data, it first it first go through the proxy and quickly get into the message queue. And data node picks it up from there and some segmentation storage work and this kind of storage work and so on. And then put it back into the storage object storage. Next index node takes data generated by data node and start build an index vector versus index and put it back into the object storage. Meanwhile, the current node pulls the latest data from the message queue to provide real time query support. When we do search the index build by index node is loaded to is noted by the current node from the object storage to local disk memory depends on what kind of index we were using. So along with some real time data pulled from the message queue mentioned before and to provide a query service. In this design, I want to highlight some key points. First, we separate storages and computations to boost flexibility. Then the whole cluster is micro service and managed by Kubernetes for automatic deployment and management. So finally, the message queue helps decouple and all stateless components. And I like to go a bit deeper inside the separate of storage and implementation. This is very classical topic inside database there actually, but due to its complexity, most of system including most of the vector most of the vector database haven't implemented this very well. It took several years of hard work to finally go distributed in the official release of 2.0, but it pays off since we've got plenty of secret weapons for handling huge data in machine learning areas. First, I would say, for example, query and index building and storage can be scaled independently to meet different needs in different scenarios. So the second one is besides we will notice, you may notice that the different roles need different type of resources. In our case is a proxy and data nodes are IO bounded and query node is CPU and memory bounded because of the heavy vector distance calculation. And the index node is also the heavy CPU usage scenario. All looking different resources to different roles can significantly improve the cluster efficiency and reduce cost. Also, upgrade failure recovery or heavy usage of specific role will not affect normal operation of other components improved maintenance and robustness of the whole cluster. So independent and stateless index nodes and data nodes can be pulled. So this can utilize the different usage hotspots of different users to improve the resource utilization of the index building speed. So this is very suitable for cloud services and it is currently adopted by ZDIS Cloud, which is based on Amoebus. This is a high level overview of Amoebus architecture and main pathways. So in next session, I would like to dive into the right pathway to show you more details about Amoebus. So yeah, this is the right pathway. So we come to a design detail of the vector database. Vector search algorithm is an inevitable topic because many design decisions are made regarding to the attributes of algorithms. So vector search algorithms are hard of vector database. In Amoebus, they consume over 80% of the CPU usage. Unlike traditional database, traditional database that perform deterministic search, which means it has to be 100% accurate. Vector database, the main feature of vector database, vector search, problem realistic. I will repeat these two words so many times within this talk later. This means that most of the time vector database don't require the absolute top K nearest results. Instead, we can treat precision for higher performance. In this picture, on the left shows nowhere, Amoebus vector search engine. It is a plug for adapters that supports various algorithms including phase series, scan from phase, and GPU index from NVIDIA roughed. And it's also something like this. So library like this. And that should stop the design and more. For from algorithms perspective, they can be roughly divided into three different categories. The first one is before search. I like to, I didn't put it on the left graph, but it's very important in scenarios that requires very high precision or some real time data search because we don't need any build time for it. So the second category is I will see this. The main idea here is to split the vector into blocks and they speed up searches by ignoring less likely blocks. And the third type is the graph based algorithm. They are the best choice if you need both high precision speed. We'll cover the basics of this algorithm next. Just to be clear, so middle of us can support all of these categories. So when I start using the system and I load in say a table, I specify what indexes I want to build or do you guys build all of them? How do I decide? You can say you have just you have a API called create index and you can specify what kind of index you want to create and what config parameters should be. And then in the hosted version of middle of us, the cloud version, since you see people creating the indexes, can you can you say like what the distribution is? Like is our most people picking hnsw or just choosing the default? And if the s was the default, like what's the most common index? Yeah, I was into aspects first is from the cloud perspective. So we will support something called all index because we want to cover the complexity of decision making of the algorithm picking. So the all index behind all index is some our self developed algorithm is a graph based algorithm. So we don't open source it and email us. We have some so many different like this different kind of algorithm and let people to pick and to our experience of from open source community. The hnsw this and the f series is the most popular one. Got the most popular one. Okay. And then but in the cloud version, you guys have a something that looks like hnsw but but it's proprietary. Yes. Okay. All right. Okay. First, let's start with the classical ff algorithm. During the index building stage, we sample we sample and cluster the data set to create some buckets. We usually do this with with a chemist algorithm. Next, we assign each vector into its closest packet. When it comes to search, we first identify the nearest the bucket to to the query vector. Then we search, then we search within the bucket for the result we need. Yeah, that is ff. Another category is graph based algorithm, which is very popular, which is very popular and widely used nowadays. In this algorithm, each vector is treated as a node and and the nodes are connected with edges. The process of building an index mainly involves connecting these nodes. There are various graph algorithms and each one has its unique solutions to this building process. So we will skip it for now. During this search phase, we start from entry point and add its neighbors to the candidate set. Then we find the closest point to the query to the query vector from the from the candidates and repeat the process again and again. The right side of the screen shows an example of a search process. After introducing various algorithms, we have to talk about how do we use them in the actual application. So there are various aspects to evaluate an index such as build time, accuracy, performance and the resource usage and so on. So here I mainly focus on two basic metrics, build time and the performance to represent it by the QPS. So here's a table. So this table includes a flat, which is a brute force search. And I have a flat that we just introduced just now. And the scan is an I have a flat algorithm with some compression and SIMD acceleration. And this is most commonly used graph based index. You can see that we need to spend more time building an index to get better query efficiency. I like to call this a trade-off between data fresheness and efficiency. So it is difficult to ensure both within a single algorithm. So currently, the majority of vector database are using HMSW as a main index. Sacrifice data fresheness in exchange for efficiency. So here's a question. Given single algorithm cannot cheat both as mentioned, the data fresheness and the efficiency. Is there any chance to do it within a more complicated system? Yeah, this is what we're doing. Let's take a look at how MIRVS tries to solve this problem. First, I need to introduce the data structure types inside MIRVS. Under each collection or table in some other database concept. So we will have a laser called the shard. So each shard can read data from the message queue at the same time to speed up the data inflow. So next is segment. The segment is smaller than the data structure unit inside MIRVS. So we built one of the indexes we mentioned above for each single segment. There are two types of segments. First is query segment. Current node reads data directly from the message queue and generates this growing segment. We usually use the flat index to ensure the speed of insertion. The growing segment is here to provide real-time query capability and ensure data fresheness. The other is the sealed segment. So when data in the segment grows to a certain extent, the data node will seal the segment to make it immutable. Then it will get handed off to the index node as mentioned before to create an index to provide more efficient queries. The image on the right shows a general structure about what I talked about. So after the sealed segment is indexed by index node, it will get loaded into the query node and replace the growing segment to provide service. So at the same time, a new growing segment will be generated for this shard to continually support the data fresheness. This complex structure brings benefits, but it also brings many challenges. For example, how should we define the size of a segment? Should it be very large? So larger segments introduce challenges in distributed scheduling and failure recovery. So because any segment transfer between nodes will be super expensive. It takes forever. More importantly, they can be incredible slow. So let me introduce how MIMOS does a query with multiple layers of data structures mentioned before. MIMOS queries require three layers of reduced operation. So the first reducer at the query node level. So multiple top-key results obtained from different segments on the same query node need to be merged into one result here. The second reduction happened at the shard level. So since segments within a shard can be distributed across a different query node, so the result produced by the query nodes need to be transferred to the shard leader. Shard leader is also one of the query nodes for another round of this reduction. So final application takes place at the proxy level. Results from multiple shards are combined here before we turn back to the client. The graph on the right shows the relationship between index building time and segment size for HMW. This is the most commonly used index. Larger segments result in longer index building time. And this will lead to accumulation of growing segments mentioned before. So remember, HMW search speed is 500 times faster than the proof of search. So the growing segment could slow everything down and affect the entire process significant. So how do we solve this? Should we make the segments smaller? Okay, let's make a guess. So which is faster in the search operation? A larger segment or many smaller segments? The chart below shows the search speed of ancient software at different sites ranging from 0.25 million to 1 million vectors with 768 dimensions. So maybe it's anti-intuition. There is almost no visible change in performance. It means that with constant total amount of data, each addition segment makes our search slower. So in addition, since each segment requires its own metadata, having too many small segments can greatly increase the pressure on metadata storage. It is in the email world. Therefore, it seems that very small segments won't work either. So what should we do then? Wait, just to be clear, the growing segment is a flat index, right? Not the HMW? Yes, because ancient software building is super slow. So the growing segment needs to serve data freshness, so it's a flat. Okay, so your diagram here looks like your table looks like the graph structure of HMW. Oh, so the graph is trying to explain that no matter how big your graph is, the search time remains the same. So if you have so... Sorry, for HMW or for flat? For HMW, it's a graph. Okay, okay. Just to be clear, the small growing segment is the flat index, though. Yeah, we talked about the small growing segment. Oh, sorry, I see the title is a small growing segment. There's one more thing behind it, small growing segments leading to small CO segments. Got it. Okay, all right, thank you. Okay, so we prefer small segments when inserting data, and the larger segments once the index is built as many before, because the growth force will need to be as small as possible, and the ancient software is the kind of thing we need to do as large as possible. So why not just build a small index first, and then merge them together, basically? This is what we're doing. This is a compaction mechanism inside of us. It sales segment before they get into big during the insertion process, you ensure the short building time, and data node actively merge smaller segments together, and which name pass the index node to build an index. Finally, the current node loads this bigger index and replace the indexes of the small segments that were merged. Now we have larger segments to support more efficient search. It seems like we have solved this problem, but what if the index node is super busy? For example, when the resources is limited, or when the inserts occurs rapidly. So it's about a growing index. Still remember this table? Our choice is not only limited to the brute force search and graph algorithm. What about IVF and SCAM as a middle solution for growing segments? The indexing time for SCAM is less than one-fifth of ancient software, but its performance is 200 times compared with the brute force search. To adapt this idea, Miller supports using the initial part of the data inside the growing segment as a sample for a classroom, as mentioned in the algorithm part before, to build a bucket, and then all subsequent inserted data in this growing segment will be simply placed into this corresponding bucket, so it will be super fast. You might still concern whether the build speed is fast enough or not. This chart shows the relation between the amount of inserted data and type, with three lines representing brute force search, IVF and SCAM. The bound in the middle of this line indicates the start of the clustering. Due to the system overhead, such as SMASQ, Ryder, RAZE, network communication, there's actually no significant difference between the three different solutions. This means that no insert speed degradations, but 200 times faster in growing segments. This is a good way to go. In addition, this design ensures that the final index in Miller's is mutable. Therefore, we can choose the most suitable optimization strategies based on data distribution during the index building time, such as some compression and pruning strategies. And also, even if the sealed index is IVF or SCAM, which is the same as a growing one. Because in the growing segment, we only use the initial vector as a sample, this will make the quality of the index get affected. So re-indexing the immutable segment is still very necessary. Well, sounds like we have solved the issues of both data freshness and efficiency, balanced the segment size and accelerated the growing segment, so anything else. In a typical vector database application scenario, unstructured data is transformed into embeddings through a model, and then inserted into Miller's to provide search capability. However, as models are frequently iterated, vectors need to be regenerated, resulting in a batch of offline vectors that need to be re-imported into Miller's. So how do Miller's deal with this offline import scenario? In addition to support online scenarios with streaming search mission before, Miller's also support some offline scenarios through batch insertion. Mainly through three passwords. First is Miller's allows a direct transfer of raw data to the object storage. This can help keep complex insertion process from proxy to messy queue, and data node need to be read out by the data node and get partitioned and then get read into the object, all this kind of stuff. Also, this approach can bypass the issues associated with compactions and growing segments. The index node can directly read the data to build the index, and it can greatly improve the efficiency of the iterative. This is saying that if you could have some spark job that is ingesting data from something else, and then you have it write out to S3 or whatever object store you're using, and it's writing out into a middle of a specific file format, or is it just like parquet files or something? It's a parquet file, so you put a parquet file here, and the index node will do that for you. The second pathway is spark. This can be imported into Miller's with Spark, so Spark users can define some data preprocessing tasks, such as embedding structures from unstructured data, data filtering, and so on. The third pathway, data can be batch export from Miller's to Spark, processed and then imported back into Miller's. We can perform some optimization based on global distribution of this data. For example, this cloud will periodically export all vectors from Miller's, and remember the IVF index we mentioned multiple times before. In Spark, we'll perform a global IVF indexing, and make each segment a bucket, and then import the data back into Miller's. During the search, we can skip most of the segments based on their distance from the query point, so this achieve performance improvement. Now we have finished the introduction of Miller's data writing path, and I would like to briefly review the features we've discussed to support machine learning. A simple machine learning pipeline can be roughly divided into offline data preparation, model training, and online model deployment and inference. From the offline perspective, a typical use case is data mining through similarity search. Let's take autonomous driving as an example. Imagine autonomous driving vehicles encounter a bear on the road, so it is unable to recognize what it is and unable to take the proper actions. So it might knowledge closely and get a bear pissed off, which is quite dangerous. To make sure we don't mess up, next time we see a bear on the road, so we need to first improve the capability of our perception models in fine-grained classification to recognize the bear. To achieve this, we need to add images of bears crossing the street in our train data set. As you know, we don't say bears crossing the street every day. It's a little bit rare. So rare data like this requires specific data mining. And commonly used approach is to extract the embedding of the bear image, then search for image with similarity embeddings in the databases like Miller's, which usually contains thousands of hours of driving records. Data mining and model training are always accomplished with very large data sets. For example, Tesla's training data set is in tens of billions. Miller's flexible distributed architecture can handle large-scale vectors very well and these batch processing pathways provide a faster insertion speed and more convenient ETL pipeline. These two factors together give Miller's capability to handle scenarios at a billion-scale level. On the online side, in addition to classic search and recommendation, vector database are also important in current hottest domain, the large-length models. So in this area, the agent is a typical scenario. An agent is an AI system based on LLM that can auto-complete complex tasks through multiple rounds of dialogues with LLM and the third-party interface core. So in AI agent systems, the AM is forgetful brain while the vector database is hippocampus. The memory loss makes every interaction with LLM like starting over again from scratch in closed-book exam. The presence of vector database turns this process into an open-book exam, I would say. On the other hand, the agent system can browse domain knowledge and private data and provide to LLM to make answers more accurate. And also the agent can record its own operations history, better understanding the user's needs and achieve better personalization. This scenario has high demands on real-time insertion speed, performance, and data visibility of vector database. And these issues, Milvus streaming insertion scenarios actively seeks to address. Yeah, so because we don't want to say we have a rubber here and we talk with them and it reacts super-slowly. Well, it seems like Milvus has converged on the pathway to supporting machine learning. We must feel very safe to speak where we are here. Cheers. No, so vector database is a completely new field where all solutions are far away from each other. So also new challenges have just arrived in year of AI as machine learning develops faster and faster. We also need to constantly think and change to catch up with development. So next I want to share with you some of my thoughts about DB for machine learning and machine learning for DB. And the directions we need to consider and also the directions we need to consider in the future. There are no clear answers and solutions for this question yet. And I hope this can bring some food for sorts. With machine learning technologies involving models rapidly enhance their capability to understand more complex semantics. Meanwhile, search technologies that boost by machine learning also get involved. For example, once I found an incredible comfortable pillow at one of my friend's house, they decided to buy a similar one and try to search for it online. Let's think about what would happen in different development stages of search. First, in the first stage of keyword search, I would just type a pillow as comfortable as a cloud and what is the result? Maybe I would get a cloud-shaped cookie cutter and some pillows. Or of course it can also be a pillow that looks like a cloud. So next step is to browse through all products one by one that takes forever. Moving to the next stage, so now we have image search. I just remember that I happen to take a photo of my friend's pillow. Okay, it looks like I can do image search with that. What do we got? Yeah, at least I get a lot of pillows at this time, no cookie cutters. Okay, so let's browse. Next step, now that we have some multi-model search, we can provide an image and add a description and a pillow as comfortable as a cloud. What do I get this time? The search isn't just about simple image matching anymore. Comments and descriptions on softness of the pillow are also taken into consideration in this result. So finally, we found the target pillow. It looks similar to the one I saw and super soft. To keep up with faster growing machine learning, vector database also needs to be more smart. When searching for similar images, we would prefer the search result to be similar to the original image at a semantic level, not just a similar outlook. Let's take a look at how vector database defines similarity. So currently, almost all vector database defines similarity between two vectors based on the distance of the vectors in L2, IP, and cosine space. I will call it semantic similarity. I will call it semantic similarity. Correspondingly, semantic similarity represents a real object similarity of two unstructured data. The current presumption in vector database is that these two types of similarity are equivalent. So the question is, are they the same? I would say it depends, or most are not. In fact, the academic award has a lot of search in this area trying to solve this problem. There are many solutions, for example, using user-defined distance calculation. It can even be machine learning models, or you can directly use a model instead of an algorithm as an index to do similar research. So the feasibility of these approaches and how to implement them, or whether there are any methods to address semantic understanding issues in vector database, all these directions we need to explore and find out. Next one is about some dimensionality curves. As model grows larger and more complex, the dimensions of embeddings, the extractors are also increasing. At the same time, to represent more complex semantic info, vector dimensions are also getting larger. This process brings a significant challenge to the storage and computation of vector database. Also, high-dimension can result in distribution of the data to become extremely sparse. This makes semantic info more difficult to be measured by exceeding existing metrics like L2 or IPDs. This is a perspective of curves of dimensionality. Therefore, how to reduce a dimension and compress a data is a very important topic in the related fields. So there are some dimension reduction algorithms on the model side trying to solve it, such as PCA and T, S and E, this kind of thing. So in the field of vector search, there are also many different quantization algorithms such as PQ and SQ and AQ, this kind of thing. This algorithm sacrifice accuracy for better storage usage and performance. However, the sacrifice accuracy is from the perspective of mathematical similarity. Therefore, a very important challenge is how to compress and reduce dimensions while retaining as much semantic information as possible. In addition to the database catching up to support machine learning, machine learning can also help enhance the database. Machine learning based Auto-Tuning is one of the most mature scenarios, just like Auto-Tuning. In fact, compared to the deterministic search method of traditional search, ML-Tuning can play a greater role in the vector database area as its probabilistic search allows for more flexibility. Aside from the performance improvement, vector database needs to maintain relatively stable accuracy under different searches to support business scenarios. This is another area where machine learning ML-Tuning is needed. Taking the IVF index as mentioned earlier as an example, from the performance perspective, besides a high-level adjustable parameter similar to traditional database, there is also a big space in the algorithm side. Either can adjust the number of samples, the number of buckles, the number of buckles involved in the search, the type of compression used in each bucket, and the extent of compression, and so on. From the usability perspective, when we try to search the top K nearest the point, the different value of K, how many K you want to get, will affect both performance and accuracy. For example, a larger K obviously requires us to search for more buckets to ensure accuracy. Therefore, we need to dynamically assign different strategies to each query, different parameters, how many buckets are discussed into each query. This is called when users do a search, we will have a model that will take the size of segments, the number of segments, the type of algorithm, and the size of K. All this comes into consideration. Dynamics can generate a parameter to ensure the accuracy stays within a relatively stable range. But isn't the challenge, though, because it's approximate, it's not like using auto-trainers example. There's a hard objective measurement we can use to say, is the AI making the system better? Is the P99 latency going down? In your case, it's this fuzzy thing where it sounds like you're adjusting how many K items you look at. But the end user won't be able to say, oh yeah, this is better than I would have had before because it's a subjective response anyway. The answer is subjective to the context of the person that they care about. It seems like a fuzzy thing to measure to try to improve. I think the most important necessary is from the usability perspective as mentioned before. So when you do a search, maybe it's subjective that the record or the precision, the accuracy is what. But you will have some underlying expectation that the result will be like this. For example, if you have a very, very big K and you only browse two buckets as mentioned here, the accuracy will be super low. And sometimes you search on Google that this time it is relevant. This is good. And the next time you try to search more, you iterate more pages and you're relevant at all. So that is not totally acceptable in real production scenario. So it's super important to maintain the accuracy in a really good range. What's the feedback mechanism to know that you're doing well? You're doing better than maybe you were before? No complaining from customers. So if you do nothing, of course, you will get some because the accuracy will change. The accuracy will vary with a different case. Okay. All right. Thanks. Keep going. Yeah. So finally, let's go back to the two questions posed at the beginning of this talk. Question one is what is the ratio between traditional search and vector search? First of all, vector search is not replacement of the traditional search but complementary. Traditional search focuses more on keyword matching while vector search focuses more on the context of semantic matching. So currently, many search systems contain both keyword search system and semantic search system. When it gets result from both modules, there will be a post-process step working on result merging and re-ranking. This is a common perspective and we start to rethink the ratio between traditional search and vector search. We divide our keyword search into two parts, sparse vector extraction and sparse vector similarity search. So we can simply use a traditional statistic-based method like BM25, TFIDF, this kind of thing to extract sparse vector and use brute force on sparse vector search to replace the classic keyword search. This structure can bring us more flexibility, for example, from the perspective of sparse vector extraction. Learning-based method can also be used to enhance the understanding of hidden semantic information, like displayed models. It's a very hot model these days and adventure here. So for sparse vector search, in addition to use exactly search like brute force, we can also apply some similarity strategies like what we do in the dense vector. This tree is a little bit of accuracy to accelerate the search process. So I like to call this a process of vectorizing traditional search. Following this, let's think about the second question. What is the ratio between traditional database and vector database? Let me give an example. What can we do if we need to use a picture of a dog to find three most similar types of dogs? A conventional vector search would return three most similar dog images, which in our case are Snoopy 1, Snoopy 2, and Snoopy 3. You might notice that this is a classical group-by scenario, because you don't want a Snoopy. You want three different kinds, and this is a classical group-by scenario. So it looks like we can try group-by dog's name and link search for most similar images in each group. Now we get Snoopy, Goffy, and Pluto. Let's take one step further. So what if the database doesn't have the name column for us to group by? Yeah. So then we get Snoopy 1, 2, 3, because we don't have the name column. So it sounds like we need to group by the image directly. So this means that besides deterministic stream matching, we also need the capability to group a vector through probabilistic similarity. Now let's simply group by the image vector themselves, and then the vector database will group similar Snoopy's together and brings Goffy and Pluto back. Now, besides group-by, we can also have aggregation based on it, or join operations, and other functionalities. So we can see that the features required for traditional database also apply to various database. It is just that we need to reimplement them in a probabilistic way rather than the traditional deterministic way. So similarly, just like vectorizing traditional search, I like to call this vectorization of traditional database. Actually, in the era of machine learning, we vectorized the text into token, image into queries, and we vectorized everything. So I'd like to leave a question here. So what else can we think of that is on the way to be vectorizing the future? Yeah. Yeah, that is all from all of my talks. And thanks for any questions. I will applaud on behalf of everyone else. Again, thank you, Lee, so much for doing that. We have time for one or two questions from the audience if anyone wants to go for it. So I'll ask my question. Maybe you might have mentioned this and I just missed it. You guys have the ability to do searches with the non-embedding. So if I want to say, show me all of the documents that have this sort of keyword, semantic search piece, and that goes to the transformer, that gets an embedding, and so you do the approximate, you know, the secondary search and the vectors on that. But then I also want to filter where the document is older than 10 days. How do you guys handle that? Are you embedding the additional metadata or the attributes about every entity or object in the index itself, or are you doing a separate search? How does that work for you guys? Oh, so sorry. So do you mean if the data is pretty old? No, no, no. I mean, just like how do you handle combining semantic search based on the vectors, plus the, or sorry, similarity search based on the vectors, plus additional like scalar attributes to do additional filtering, like where clause, like where age is greater than this, or where country equals Canada or something like that. Okay, okay. So this is we called filter search. So we also will have to do the same thing. So we will have, because I would say vector database is not a replacement or something of traditional database. It's just an extension because we also support something traditionally with support. We have this scalar column and we will do some, we have some, some preferable, we can do, we will do preferable with inside and we will generate a base view. This base view indicates in this vector segment that which point is getting filtered. And then we pass this base view to the algorithm. The algorithm will use it to do the search. And this is how we do the filter search. Yeah. So you filter on the scalar values first, then do the similarity search. Got it. Okay. Actually, it's on our roadmap to make it more complicated because we notice that prefilter and infilter and filter means that we, I want to mention before, before. So we do the calculation before, a scalar calculation before, and then we do a vector. And the infilter means we do the vector. And in the meanwhile, we do the scalar one. And at some time post, they have different, they are suitable for different kind of scenarios. And we need some kind of analyzation here to say which one to go at the very beginning. So yes, this is on our roadmap. Okay. So then that means you have, do you have a notion of selectivity of the cardinality of predicates beforehand? And like, is that, like for scalar, for scalar columns, you know, there's well-known textbook techniques to do this, but for like the indexes, the vector indexes, you know, that's hard. Yeah, it's hard. And also, it's very, it's a little bit complicated inside the vector side because for different kind of algorithms, the filtering will have different kind of problem to solve. For example, I have things, this bucket based, it's the most, the obvious thing is you need to enlarge some bucket C mode to get very good accuracy. And for the graph-based things, another problem is if the filtering rate is too high, so which means they filter so many things out. And you need to decide whether you want to use this, you want to go through this filter point or not. If the answer is yes, it will be super slow. And the answer is no. There will be an island and you cannot, in standard graph, you cannot go out. So the accuracy is super low. You have some special solutions for it. And also, this is what we are doing right now.