 So we're excited today to have Cheng Chen. He's a staff engineer at Singles Store, where he's been there for about five years now. So he's here to talk about the, you know, the data systems they're building at Singles Store. Again, the idea here is to put into context all the things that we talked about this semester and show that how when you build a real database system, this is actually what they actually look like, the things that you covered this semester. Although things in bus top are not going to be as efficient as you would do in a real system. And Cheng will discuss some of those things. So in addition to being at Mansepo for five years, he also did his undergrad at Xinguang in computer science. So we're going to go to hands here and hit for him to talk about their data system in addition to how they've added vector support. So with that, Cheng, the floor is yours. Thank you so much for being here. And I can say also too, for the class, if you have any questions, raise your hand and then we'll call on you and then you can ask your question. And then feel free to do this at any time that way Cheng's not talking to us all for an hour on Zoom, okay? All right, Cheng, the floor is yours, go for it. Yeah, thank you, Andy. Hi everyone, my name is Cheng and I'm a software engineer at single store working in the core execution team. Yeah, so as Andy said, if you have any questions during the talk, feel free to just interact with me. Yeah, my talk today has a controversial title. Do you need a specialized vector database? So the last decade has seen a remarkable rise in specialized database systems. There are systems for transaction processing, data warehousing, time series analysis, full text search, et cetera. And now the hype goes to vector databases. A belief has taken hold that a single general purpose database is not capable of running varied workloads at a reasonable cost with strong performance. And as a level of scale and concurrency people demand today. There is value in specialization by the complexity and the cost of using multiple specialized databases in a single application environment is becoming apparent. This realization is driving developers and IT decision makers to seek databases capable of powering a broader set of use cases. So single store is a distributed general purpose SQL database. Decide to run a broad set of use cases with good performance. In the first half of my talk, I will provide an architecture overview of single store explaining how we can buy both low store and a calm store to support both operational and analytical workloads and detailing our index and the filtering design. In the second half of the talk, I will take a deep dive into the upcoming vector index feature and explain how we integrate it into our index and the filtering design. So let's start with single store overview. Yeah, so single store is a distributed general purpose SQL database. It is capable of handling both operational and analytical workloads. It can run both TPCH and TPCDS completely with state warehouses. And it can also run TPCC completely with operational databases. It is cloud native that the architectures are designed to take advantage of elastic cloud infrastructure. It can scale up to efficiently utilize hundreds of hosts, thousands of calls and tens of terabytes of RAMs while still providing a user experience similar to a single host SQL database such as Oracle or SQL Server. We have hundreds of customers with demanding production workloads in large finance, telecom, energy and tech companies. So to justify what I said that single store can handle both operational and analytical workloads with good performance. Here is a TPCC and TPCH number taken from our VLDB 2022 paper. Our TPCC numbers are competitive with a popular cloud-ware operational database and it scales well with a cluster size. On the TPCH numbers, so our TPCH numbers are competitive with two popular cloud-ware warehouses. So the key differentiator here is that we can do both operational and analytical workloads very well. Even better, we can run those of workloads concurrently with good performance. On the other hand, cloud-data warehouses only support data warehousing and can't run TPCC. And cloud operational databases can run both benchmarks but it performs orders of magnitude walls on TPCH. So let's look at the product overview of single store. So we have very fast data pipeline to ingest data into single store. We have a three-tier architecture. The top tier is a very fast, in-memory, rock-free, skip-list-based low store. The second tier is our own disk, LSM-based comm store resides in DSSD. The bottom tier is an optional shared storage for code data. We use the same storage for all tables and it works well for hybrid workloads. We are multi-modal, meaning that we can handle a broad set of workloads with just one single database. We can handle operational and analytical workloads concurrently with good performance. We support structured relational data, unstructured JSON data, geospatial, time series, full text, et cetera. And our vector search support is coming. We are compatible with MySQL and also support Mongo API. In addition, we have strong extensibility support, allowing users to write wasm functions or external functions. We use a shared nothing architecture that scales out very well while providing a user experience similar to a single host SQL database. Dang, are you gonna talk about skip-list a little bit or no? No, I'm not going to talk about skip-list. Okay, so for everyone in class, it's a probabilistic B plus tree. Like you kind of flip the coin and decide whether your branch or not. I'll send something afterwards. Another question. I have a question about this slide. How is it shared nothing if there's a blob storage at the bottom? Did you hear that? I don't know. I don't know. I don't know. I don't know. I don't know. I don't know. I don't know. Did you hear that? Yeah, yeah. Good question. So basically each node will have its own partition, but the cloud storage basically provides unlimited storage that we just upload the blobs to the blob storage. So I think what you're saying is that every node basically manages its own buckets in the blob storage. And they're the only ones that can read and write to it. So it's like an infinite disk, but it's still shared nothing partitioned. Yeah, but basically each partition has like a designated leaf node. So it's kind of shared nothing, but each leaf node has unlimited storage. Got it. Are you gonna talk about the hierarchical structure of the query execution with the trees or no? Yeah, I'm going to go deep into the query execution part. Okay, awesome. Thanks. Yeah. So yeah, let's look at the cluster architecture as single store. So single store is a horizontally partitioned shared nothing database with an optional shared storage for code data. There are two types of nodes, aggregators and leaves. So clients connect to aggregators and aggregators are responsible for handling colonization and planning and also coordinate distributed queries. So leaves hold the actual data partitions and perform most computation. So tables are hash partitioned by a usual configurable set of columns called shard key. Each leaf holds several partitions of data. Each partition is either a master partition, which can serve both read and write or a replica partition, which can only serve read. This enables fast query execution for pointer reads and query shapes that do not require moving data between these. So let's look at an example. So here is a cluster with one aggregator and three leaves. The table is sharded into three partitions and each leaf holds one master partition and one replica partition. So let's say now the usual query is aggregator with select average A from dv.t. The aggregator will query each leaf with select sum A count A from each table partition. It then computes the total sum divided by the total count and returns it to the usual. So this kind of simple queries do not need any data movement. So certain query shapes require moving the data. For example, in distributed drawing, in the simple case where the shard key matches, the aggregator can just push down the execution to individual partitions. But if the shard key does not match, the leaf needs to redistribute the data about broadcast or shuffle. Optimizer also needs to take these data movement costs into account. And certain queries also need to be transformed in order to be efficient to execute it. So single store is able to run all TPCH and TPCDS queries with good performance. Yeah, any questions so far? I think we're good. Yeah, so now let's focus on a single node and see how it can handle hybrid workloads. So let's start with what hybrid workloads look like. So OLAP workloads involve complex data analysis. It requires gaining hundreds of millions to trillions of loads in a second. On the other hand, OLTP workloads involve transaction processing and the real-time updates. It requires writing or updating millions of loads per second. Real-time analytic workloads involve running both OLAP and OLTP workloads concurrently. It requires running complex interactive queries over large changing data sets, concurrently with high throughput, low latency read and write queries with predictable response times. So we are looking for a unified table storage that is efficient for both analytical workloads and the transactional workloads. We know column store has good scan performance and it's good for analytical workloads. So the main idea is to desire table storage, good for hybrid workloads. So the main idea of to desire table storage, good for hybrid workloads, is to use operational optimized column store. It is basically a column store with modifications to better support selective reads and writes in a manner that has very little impact to its compression and table scan performance. So let's look at how this is implemented. So each single store table has an in-memory low store segment to store small writes and avoid creating small sorted runs across many files. Loads are first written into in-memory low store segment. Background flushes periodically flushes loads from in-memory low store segment to a new column store segment. Background module merges column store segments incrementally to maintain a logarithmic number of sorted runs. In the case of batch loading, loads are directly written to a new column store segment without touching low store segment. So all column store segments are immutable. Delet and update will just mark the loads or delet in the segment and move it to the in-memory low store segment. The deleted bits are stored in the in-memory metadata table. So this part is very similar to the existing LSM. Do we have any questions so far? Yeah, I just wanna put the context of the class. So I think lecture four, five, we talked about log structure storage. This is basically the same thing. The LSM is just log structure, we're not, we ignored the nursery part. So again, there is buffering all the rights, but instead of then writing out the disk page, the block as row stores, the log structure stuff, the SS tables, they're throwing it as a column store. So it's the basic ideas that are synthesizing the row store and the log structure stores together. So you do faster writes in log structure and then you do the faster scans on the column stores. Yeah, so we'll talk about the fast scan fast fast way to find and seek into any loads in the column store later. So it requires effort from all layers in order to support hybrid workloads with good performance. I will briefly mention the optimizations we do for storage layer, but I'm not going to go deep into this. Our design is optimized for tier storage. So all blobs are immutable so that we don't need to change a blob in S3, which can be slow. Flasher and module runs in the background and there is no blob rights on commit. It only writes to write a catalog. Our storage also allows for out-of-order replication. This allows small transactions to be committed without waiting for big transactions. Yeah, so next let's look at the optimizations we do at query layer to support hybrid workloads. So our comm store is optimized for analytical workloads. Some optimizations we do are vectorized execution where we execute on a batch of loads using SMID instruction. We also support encoded execution where we can execute on encoded data for some filters, goodbye and hash join operations. We also do late materialization where we only decode loads and columns that are needed. So these optimizations are kind of common. They can be commonly found in other OLAP systems. So because our comm store also needs to be optimized for operational workloads, we do some operations that are less commonly found in other OLAP systems. So the goal of these optimizations is to allow us to fund and seek to individual loads efficiently. So for example, all the column encodings we use are seekable. It allows us to efficiently read at a specific load offset without decoding all the loads. We store the segment metadata in memory. It includes the min max of a segment, the min max and deleted bits for each segment. And it allows us to run efficient segment emanation to reduce the number of segments to scan. Are those the zone maps? Yeah, yeah. Okay. It's kind of the zone map, coding other vendors, yeah. Got it, awesome, thanks. So a sort key can also be specified on each column store table. When specified, flush will sort the loads within a segment based on the sort key. Module will merge the segments incrementally to maintain a logarithmic number of sorted loads. And the sort key can allows for more efficient segment emanation. Moreover, we support secondary hash indexes for efficient point access. I will discuss this in more details next. So because our deleted loads is stored, because the deleted loads in a segment is stored as a bit vector in the segment metadata. So while this representation is optimized for vectorized access during analytical queries, they can also introduce a source of contention when modifying the segment metadata. So single store increments a low-level locking mechanism to avoid blocking during transaction or workloads. This is done by first moving the loads to be updated or deleted to the in-memory load store segments. Another operation we do is full query code generation. It's kind of beneficial for both OLAP and OLRTP workloads. The first time a query is wrong, single store will optimize and compile the query into machine code. When the same query is wrong in the future, we will directly wrong the machine code. All second queries are parameterized. The constants in the query are replaced by placeholders. For example, the constant Y and ABC will each be replaced by a long-term parameter. This allows for the same query plan to work for any constants. And the compilation is done by first coding a CPAR path like MPL code in the execution plan or for the execution plan. The CPAR path like MPL code is then flattened into assembly like NBC opcodes, which is then passed to LLVM to compile the machine code in the background. We start executing the query by interpreting the NBC opcodes and the switch to machine code during core execution when compilation finishes. This operation is very effective for evaluating complicated expressions. Yeah, any questions so far? So let me just put the context to the class because we don't discuss query compilation or code generation in the intro class. We just look at this in the advanced class. Think of like in bus tub, you have everyone in the next function. So when you run a query, you're just calling next, next, next, that's going down the tree and pulling tuples up. But then inside of all that code, there's like expression trees where there's like giant switch statements that do different things based on the types. A lot of that is hidden away from you guys because it's all utility code. But at the end of the day has to say, okay, well, let me go look up what the type of this is. And then there's a switch statement that says, okay, if it's integer do this, if the float do that and so forth. All of that adds additional overhead. So what they basically do is they look at the query plan and then they bake a program that it executes only that query plan. And you know exactly what the type is, you know exactly what the data looks like. And therefore there aren't any conditionals, like if it's integer do this, it's a float do that. You say do exactly these things. You then compile that and you're basically running a program that as if it was handwritten exactly for each query. That's the basic idea here. Now, it doesn't come for free. Obviously there's cost in all these steps in particular going from assembly to machine code. The LLVM step can take tens of hundreds of milliseconds. So if your query is going to run for five milliseconds, do you want to compile it for 300 milliseconds? All right, so there's a trade off to doing this. And so the parameterized query part is basically the way to say, okay, if I pull out the constants, like where B equals ABC, I pull all those things out. And then now I can pass those parameters at runtime to a generic program and then I don't have to recompile it over and over again. So memsecord, sorry, singles to order is the most aggressive in this of all the commercial systems. Postgres only compiles the where cause, like the expression trees. There's the Umbra hyper system where they're doing the same kind of stuff. But this is rather unique to single store. They didn't invent it though. It's actually as many things as databases. This was invented by IBM and System R back in the 1970s where they were generating assembly for every single query. But then it was a super pain in the ass in the maintain because every single time you change something in one part of the system, you had to change the compiler piece. Because in this case, because they're generating C++ and these op codes and they have their own basically DM, this makes their implementation a bit more robust to change the underlying system. So this is a really cool idea. It's very hard to do very few systems do this. Yeah, thank you Andy. So next I would like to discuss our secondary index designs and how it allows us to support efficient and the flexible table scan. So there are two common indexing approaches for SM tree. So the first approach is external index where we build an index structure outside of the SM tree used for table storage. And it maps the secondary index columns to the primary key. So this approach can also be applied to no SM tree. However, this approach encodes an interaction in index lookup for each measure low for finding the lows in the primary SM tree storage. So this overhead is particularly significant when there are many measure lows. We haven't talked about vector index yet, but I would like to mention that we think this is, this might be one of the reason why some Postgres based vector index implementation has slower vector search performance when comparing to specialized vector databases. Also, if you use a specialized vector database together with your main database, the vector index in the specialized vector database is essentially external index. Another common approach is to build an index for each segment. This way, the index is integrated into the primary SM tree and therefore doesn't have the actual SM tree lookup overhead. When comparing to the external index, this approach includes additional Olog and write-on application due to merge. But we found it to be acceptable trader for efficient index lookup. Another problem with per-segment index is that there can be a large number of segments and we need to scan each of its index. Because index generally has sublinear search complexity, this means that searching a larger index is cheaper than several small ones. So we extend the per-segment index and have an index module that builds an index SM tree. The background index module periodically builds cross-segment indexes on multiple segments. This does come up with additional Olog and write-on application, but we've still found it to be favorable. The index module happens independently from segment module because different indexes have different characteristics and may benefit from different module policy. And we don't want them to interact with each other. Yeah, any questions for this? Sorry, maybe you said this in the beginning. This is just another skip list for you guys. This is not skip list. This is any secondary index. This includes a secondary hash index, full-text index, and also the upcoming factor index. Ah, okay, so these are all your just, okay. But if you wanted a... If you wanted a order-preserving index as a secondary index, like on a non-primary key, like would you get a skip list, or would you get something else? We don't have range index yet besides the primary index. This is talking about the index on the ComStore part, the ComStore SM. We are both describing different ways to build the index for the SM tree. Okay, awesome, thanks. Yeah, yeah, basically what we do is Persegment index plus index module. Yeah, so let's look at our secondary hash index example. It has two levels of index. The Persegment index is a regularly inverted index, maps the value to the list of low offsets in the segment with that value. The cross-segment index is another hash table, maps the hash of the value to a list of segment ID and starting location of the corresponding posting list. The index lookup involves first searching the cross-segment index to find the list of segment ID and posting list offsets, then for each select segment, use the starting posting list offset to find the low offsets. So here's our graph showing that. So we have an index column on string and this segment contains a bunch of full and bars. The Persegment index is an inverted index from the value full and bar to their low offsets in the segment. The cross-segment index is a hash table from the hash of full and bar to segment and posting list positions. The search starts from cross-segment index to find segment IDs and posting list offset. Then for each segment, use the posting list to find the low offsets. So comparing to the Persegment index alone without any index module, this improves the index lookup performance from OM to OlogM where M is a number of segments. So one nice thing about Persegment index and our index SM is that it allows for a unified way to identify a low efficiently by segment ID and the low offset. This includes common storage, derelict bits and all kinds of segment indexes. And this allows us to correlate between them efficiently during table scan. And the unified way of identify a low efficiently allows us to implement adaptive table scan which is important for hybrid workloads because hybrid workloads needs to combine different access methods and apply them in the optimal order. Static decisions made by optimizer doesn't always work because the cost depends highly on query parameters and the encodings used. So the adaptive table scan scans the table with a filter tree and it outputs loads passing the filter tree. So here is example of filter tree. So the leaf node are filters that are connected by intermediate and all nodes. Some leaf node can be executed using index. Some are just regular filters. It can be something like range filter for integer column or pattern matching for a string column or it can involve multiple columns with the optri complicated expression. So the adaptive table scan involves three steps. So the first step is for each partition with four select then segments to scan. This is the first step to select the segments to scan. This is done by segment elimination using cross segment index and the segment min max metadata. These steps output a list of segment IDs to scan next. Oh, do we have any questions? Okay. Then for each segment, we use the per segment index together with the filters to select the loads to scan within that segment. Okay, everything good? Yeah, keep going. Sorry, yeah, the dog is barking. Well, okay, okay. Oh, sorry. Yeah, so the second step is for each segment we use the per segment index together with the filters to select the loads within that segment. We support filter reordering to find the optimal execution order which we will discuss in more details in the next slide. These steps outputs the selection factor. Then we do low projection for each block. The block is a batch of loads for query processing in a segment. The default size of a batch is four kilos. We adaptively make the decisions to whether to use scale six or scan, whether to use a common group, whether to select, decode the column or just send the encoded value to the next operation. Sorry, yeah, any questions? Yeah, you said a segment size is four kilo, 4,000 tuples or 4,000 kilobytes? The segment size is one million loads. A batch of loads is basically a batch of loads for query processing. That's four kilos. 4,000 tuples or four kilobytes. Yes, 4,000 loads, 4,000 tuples. Got it, okay, awesome, thanks. Yeah. Yeah, so this is basically adaptive table scan, everything fits in these three steps. So now let's talk about the filtering in more details. So the reason we need filter reordering is because there are different ways to evaluate filters, each with different trade-offs. It can be executed as a regular filter where we selectively decode the column then execute the filter. It can also be executed as the encoded filter where the encoded filter executes directly on compressor data. So one good example would be dictionary encoding with a small number of values. So group filter runs several filters in a group instead of running them separately. This is good if the selectivity of each individual filter is low. So index filter reads the index and finds the filtered low offsets. It's usually better to run the index filter using the index as a regular filter, but it can still be worse if the other filters already filter the result to a fewer lows. So single store supports adaptive filter reordering for each 4K block of lows. The way it does is for each segment, we estimate the cost of each strategy by timing it on a small number of lows. Then each block where we reorder the filters based on the cost estimate and a selectivity from previous block. Yeah, any questions so far? So this is basically the overview for single store architecture with the emphasis on the core processing part. Can you talk a little bit about when you actually run a query, you mentioned is it vectorized execution, SIMD stuff, but like, you're getting rows and you're getting columns or portions of columns, like in your query engine, is it converting everything into a single form, like everything becomes a column store, everything becomes a row store, or do you have separate execution engine pieces for like the row store and column store stuff? If you understand my question. Like what does it cost us like? Yeah, for column store, we converted the low segment into column store. So it will always be a batch of lows, but we can basically, it's a general interface that supports the internet representation to be both row store format and a column store format. And we can execute on both, but it has a general interface that the operator doesn't need to be worried about that. I haven't read the CMOT paper yet, but like within like scans, you're doing like vectorized, I think you mentioned using SIMD for that as well, right? Yeah. Okay. Can you say publicly like, do you go up to AVX 512 or just AVX 2? I think it's only AVX 2. Okay. Okay, awesome, thanks. Also, we also detect the CPO capability and decide which instruction set to use. Of course, yeah, of course. All right, any questions from the class? I have a question. Do you have specialized SIMD code or are you just using basic like GCC or Clang optimizations for it? We will write the SMID instructions ourselves for certain populations. It's intrinsics, right? Yes, we also intrinsics from the GCC. Got it. Any other questions from the class? And actually, can you briefly talk about the WASM stuff? Like you say you're compiling UDS from written by your customers. And I think you guys still support the MPL, the single-store programming language, there was like, there was, or Mexico programming language. There was like, you guys had your own version of PLPG SQL, but then you could then paralyze that. And you're still doing that, but you compile down the WASM and you actually do things as WASM? So we basically support three different extensibility. We support you write UDFs in MP SQL and we support you write it in WASM. We support it written in external function. So the only difference is that, so during the core execution, there will be some places you will make a function call to call the UDF. So for the MP SQL, you call the compile, basically compile the MP SQL functions. In the WASM case, you invoke the WASM runtime to run the already compiled WASM code. In the external functions case, we just make a network call to call the external service. So we have a WASM runtime setting next to the engine and just compile the WASM code for us. Got it. And you do, I mean, this is getting off topic. Do you do any introspection to like, yeah, obviously for the external functions, you can't, but for like the thing, whatever the M SQL thing, like for the local UDS, do you do any introspection to understand what it's actually doing to help you get cost model estimates about the output of the function call? Like Microsoft does something similar like this or something based on this. Like are you looking inside the UDF to make sense of what's going on? I'm not too sure about that. We definitely compile the MP SQL during the compilation. We probably do some authorization. Got it. But for WASM and external function, I don't think we do anything. I mean, you can't, it's external, you can't do anything. Okay. I have a question. Yeah, question. Yeah, so from what I understand, the way you handle both analytical and transactional workloads is you can create a table to be either row store or column store, right? Yeah, it is both a low store and column store, but it's a low store set on top of the column store as a catch. Also every table you make is both row and column store. Yes. It's a single column, a single table type. Okay. And the row, so the row store is in memory and the column store is in disk. Is that the distinction? Yeah, that's correct. The reason is we cannot make the user to choose which one to use. And it also doesn't make sense to make it a choice because the hybrid workload might involve both workloads. Okay. So we need to do both. If the row store runs out of, if you run out of space for the row store and memory, then I guess you spill some of it to disk. Yeah, the low store segment is supposed to be small and the background flush or palliative flush to the low store segment to own disk column store segment. Okay. Just making sure I understand. Thank you. It's the same things along the storage. It's just what gets written out as columns, not rows. Whereas when we talked about it, it was there, the row gets dumped out. Yeah. Okay. Nice. I think the high level architecture is similar. I think the additional work we do is we make our column store can allow for operational workload. Basically what we allow you to allow allows for fast seek and the fast seek and the fast find any of the low. Again, when it was called MemSQL, original MemSQL was just a in-memory skip list, right? In-memory row store. Then they added a column store, but it was like, as you said, you have to declare where you want to row store table or column store table. The single store name when they changed it, obviously it was because, okay, now it's one table type. You don't care how it's being stored. It just handles everything for you. Okay. Right? In my opinion, it's the right way to do it. All right, keep going. Good stuff. Yeah. So we finished the single store architecture. So in the second half of the talk, I will talk about how we build vectoring in single store. So before we go into the details, let's start with a quick crash course on vector search. So vector search involves n vectors and another query vector. And the goal is to find k-niless neighbors to the query vector. All the vectors are dense and in a d-dimensional space. And the vector similarity is measured by some distance metric. The common distance metrics are of gradient distance, dog product, and cosine similarity. Finding the exact k-niless neighbors can be slow. So people are looking to the approximate version called ANN, approximate niless neighbor. And the accuracy of ANN is measured by the ratio of exact k-niless neighbors form called recall. So vector search is empowered by representation learning. Representation learning learns to represent objects with vector embeddings with a guarantee that semantic similar objects are closer to each other. For example, the text describes a picture below. So their vector embeddings will be close while the vector embedding of random audio will be far away. So vector search gives a lot of popularity recently due to generative AI. In generative AI, you'll ask a question to the generative model like chat GPT and it will give you an answer. But those generative model are trade on old and a public data and are unaware of new and domain specific knowledge. It is possible to retrain a new model or fund to a existing model to incorporate new and domain specific knowledge. But that's inefficient and costly to do. On the other hand, retrieval augmented generation or RAC is a cost efficient approach to generative AI. You start by gathering all the documents for your need and use the embedding model to generate the document embeddings. Now when a question comes, you use the same embedding model to generate a question embedding. Then you do a vector search on document embeddings and a question embedding to find the relevant document. Now you pass both the question and the relevant document to the generative model and use in-context learning to produce a more relevant answer. So RAC can also help reduce generative AI heroization because it provides some sort of source citation to the relevant document. So vector search is not the only way to retrieve texts. In fact, single store has support for full text search for several years. The difference between vector search and the full text search is that full text search relies on keyword matching and it doesn't capture and can capture semantics. Moreover, vector search can be multi-modal and it can be used to search for not only text but also image, audio, and video, et cetera. However, vector search is more competitionally costly compared to full text search. We believe that both vector search and the full text search are important and the single store provides a usual friendly interface to use them both together with other database features. So vector index algorithm is a hot research topic. People have developed many good vector index algorithms and new ones are being proposed every year. The existing ones can largely be categorized into four groups, tree-based, hash-based, quantization-based, and graph-based. Tree-based algorithms like KD3 are most effective in low dimension by degrees exponentially with high dimension. Hash-based algorithm use locality-sensitive hashing to build hash table core similarity instead of exact search. So hash-based algorithm use locality-sensitive hashing to build hash table for similarity instead of exact match. It has good worst case theoretical bound but needs many hash tables to amplify its accuracy at the cost of increased search time index size making it impractical in high dimension. So the popular approaches are quantization-based and graph-based. So quantization-based algorithm partitions vectors into clusters of nearby vectors so that search can be done by scanning a small set of clusters. Graph-based algorithms builds proximity graph that factors are connected to its neighbors so that neighborhood can be searched efficiently. So quantization-based and graph-based algorithms are so far the best and the most popular in practice. Most vector index algorithms are fully in memory but there are a few that are optimized for on-disk use that have small memory footprint while also offering good search performance. Some vector index algorithms can also be wrong in GPU and get better performance. Let's look deeper into two common popular vector index algorithms, IVF and HNSW. So IVF are inverted in file. It partitions vectors into clusters and use the centrals to represent each cluster. The built inverted index from cluster to centrals is built inverted index from clusters or centrals to vectors. And during the search, it first finds the nearby centrals to the query vector and only searches within nearby centrals. And next, we'll look at the popular graph-based algorithm called a hierarchical navigable or small world or HNSW. So I'm going through this algorithm very quickly. The details of the algorithm doesn't matter that much. I'm just giving you a feeling of the algorithm. So in HNSW, you can think of it as a skip list over proximity graph. Each node is only connected to a small number of neighbors and the search is kind of greedy starting from the causes to layer, up layer and refine with lower layers. Yeah, so the exact details of IVF and HNSW algorithm are not important for this talk, but I want to give a quick comparison between the two to show that there is no one algorithm dominates the other. So HNSW has hierarchical than IVF. It's also faster to search. The complexity is about all again for this all square root of M for a common parameter setting. On the other hand, IVF is faster to build and a small and has smaller index size. I would also like to mention product quantization or PQ. It is a vector compression technique that applies to various algorithms, example IVF PQ and HNSW PQ. It not only saves space but also speeds up distance computation. However, compression is a lossy so it usually needs to refine the result for better recall. So vector index algorithm can also be composed. Recall that searching nearest neighbors in IVF index involves first searching nearest centrals which is yet another A and N problem and it can be speeded up by another vector index. For example, IVF plus HNSW builds the HNSW on IVF centrals. This is particularly good because HNSW has a large index size but it's now built on a smaller number of centrals or HNSW has high recall which is crucial for searching nearest centrals in IVF. So let's look at the vector search offerings in the market. Know that the data here is somewhat outdated. So we have a large number of specialized vector database and also existing database vendors adding vector search support. The point I want to highlight here is that many advanced ones do support different vector index algorithms because they provide different but meaningful trade-offs. So that being said, single stores... So yeah, that being said, the main point I want to make here is that single stores vector search is implemented in an algorithm agnostic approach that allows us to plug in any vector index algorithm and we commit to adding more vector index algorithms with different characteristics and improve the existing ones. So now let's see how we build vector index at single store. Oh, by the way, do we have any questions so far just about the overview? So it's a very quick one. I think we're good, we covered, I mean, we had a lot of speakers this semester talk about their vector vector databases. So I think a lot of students have seen this stuff before. This is good. Okay, good, yeah. So now let's see how we build vector index at single store. So record as single store table is our own disk com store LSM class in memory, low store segment. Wait, sorry, we have a question for students. How do you provide the same plugin for memory-based index and disk-based index? So basically for, yeah, I think it might be more apparent when I talk about how we build the index for in this section. Okay, that's good. Basically the idea is we build an index for each segment and we have an index module so yeah, basically we are using the vector index in the other black box where we don't care about the internal workings of the vector index. Yeah, so record, yeah, so we build a vector index. Yeah, so because the in-memory low store segment is very small, so we don't need to build a vector index there and we can just rely on brute force search. We only build a vector index for own disk com store LSM and in particular, we start by building vector index for each segment, then use a background vector index module to merge several small per-segment vector index into one larger cross segment vector index for multiple segments. Yeah, so the way it does is basically background flasher and module creates a new vector index for each new segment created. Auto table will create a new vector index for each existing segments. An update delet will just mark the low delet in the segment, we will rely on delet bits to filter them out during search. But if too many lows gets deleted in a segment, we will build its vector index. And similar to other segment indexes, vector indexes have sublinear search complexity and therefore searching a larger index is cheaper than several small indexes. So we also have a vector index module that builds a vector index LSM tree. The background vector index module periodically builds cross segment vector index for multiple segments. So unlike hash index, there is no true-level indexes here. The cross segment vector index is just regular vector index, but spans across multiple segments. This does come up with additional O log N write implication to the index module and it's more significant here because vector index is expensive to build. So we only do merge, we do merge less frequently and we only do it on code later. Yeah, so I hope this answer the question how we can buy both in memory vector index algorithm and on disk vector index algorithm because we are just using it in a back box way. Yeah, so one nice thing about this design is that we are using it with a black box. So this allows us to plug in any vector index algorithms. In 8.5, which is coming this January, we support many popular in-memory vector index algorithms. Is vector search deterministic? Yeah, it is not deterministic because vector search is kind of approximate. So when you build, for example, even for the same vector index algorithm, it's probably that for the same segment, you will always choose the same toolkit. But depending how you are going to module vector index, you will build the index in different ways then the result can be different. So there's really no deterministic guarantee. Yeah, so in 8.5, we support many popular in-memory vector index algorithms. So this includes IVF flat, IVF PQ, IVF PQ pass scan, HNSW flat and HNSW PQ. So we support all the state-of-art algorithms. Post 8.5, we are planning to support on-disk vector index algorithms. Another nice thing that, another nice thing is that vector index can be built on external device. We are actively exploring external vector index build device using GPU. Also, it can be hard for average users to pick the best vector index algorithms for a given task. So there's one more question. We need the basic answer. Yeah, I'm just trying to explain it right now, yeah. So it can be hard for average users to pick the best vector index algorithms for a given task. It is even harder to choose various parameters a given vector index algorithm provides. Because our vector index is always built on immutable data, we can try different algorithms ourselves and pick the best one for the user. We can even try a machine learning model to do so. Moreover, we can use different vector index algorithms for different sets of segments. The user just need to tell us what the requirements are. For example, how much recall they want. Are there any memory limitations, et cetera. And we can build auto index, auto vector index that selects the vector index algorithms together with its parameters for the user. Yeah, so this is how we build the vector index at a single store. Do we have any questions so far? If you guys, I think you mentioned this before, like you didn't implement all these indexes yourself. Are you just using like fasts or something off the shelf? Yeah, so we just use existing vector index libraries and we can plug in any libraries. We can use face, we can use hnswlib and so on and so forth. So what do you guys use? Like can you say? Why not we use face? But we plan to use hnswlib next because the hnswlib implementation in face is not that efficient. Got it. I should point out, so another question from Avery in the chat. I should point out as a two, she also works on the PG vector RS, the Rust-based vector index search for Postgres. That's why she has a lot of questions. So a question is, how do you handle filtering with graph-based indexes like hnsw? Yeah, yeah, I'm going to talk about the vector search part in the next section. So this section is only about vector index. Any questions for the index build? Okay, great. Yeah, so let's talk about vector search at single store. Yeah, so before we talking about any filter, so let's start looking at a regular ANN query where there's no filters. Let's start with a simple case. So this kind of vector search can be represented in SQL using auto by limit. For example, we have a table of T here with a vector column and in finding K-neuralist neighbors to a query, can natively be written as a SQL select query where we select the Euclidean distance T dot V between T dot V and the vector as distance D from table T and auto by D limit K. So here the special operator is a shorthand symbol for computing Euclidean distance. So this vector search can be executed by what we call auto by limit push down. So aggregator already pushes down auto by limit to leaves. It's commonly done by selecting top K from each leaf, then do a merge sort. But it is also possible to select less than K loss from each leaf initially and then iteratively select more loss. So let's focus on the leaves because the aggregator part is, we know how to do the push down. So the harder part is on the leaf. So on leaves, vector search can be done by pushing down auto by limit to table scan or the top filter. The top filter will scan the factor index and selects top K loss. So here is how the execution plan looks like on leaves. Again, we are ignoring the aggregator part. So the table scan on T has a top filter and selects top K loss. The notation index next to top filter means that the filter will be executed using vector index. Vector index can only scan the vector index but not the table. So the distance it computes might be approximate. For example, in the case when PQ is in EZOs. So there's another top sort that reads the selected vectors from the table and a source based on exact distance. It's basically the refinement of PQ. It is possible to skip reading the table if the distance in vector index is exact or if the user is found with approximate distance. So let's see how we can fit vector search in our adaptive table scan we talked about earlier. So the filter tree in this case is very simple. It just contains one top filter for the vector index scan. It's a single filter node. And the vector search is done during the pro partition segment selection. During pro partition segment selection we scan all vector indexes within the partition and select top K for the entire partition. And the top filter will basically evaluate to true if and only if the low is top K within the partition. Yeah, before we go to the filters, are there any questions for the basic ANN without any filters? Make a good. Okay. Yeah, now let's try to use vector search with other filters. Say we want to find top K vectors with a certain property. This can be representing SQL using a well cross. So this becomes problematic because we need to output K loads passing the filter. But if we run the vector in the scan to select K loads, then run through the filter, there will be less loads after filters. So we can let the vector index scan to anticipate that and output more loads than needed at the beginning. But in practice, this is very hard to predict how many loads are needed for vector index scan. So therefore the filters need to be executed before the vector index scan. And we do that by make vector index filter aware of its pre-filters. So here is how the execution plan looks like. The filters are now pushed down into the top filter so that the vector index scan can be aware of them. And the semantics of the top filter becomes that it will output top K loads post its pre-filters. So this way we are not changing the semantics of the query if we push down the filter to the top filter. So this is how we are going to execute. Remember that top filter for vector index scan is executed during poor partition segment selection. So I'm ignoring the other two steps in adaptive table scan. So we first run segment animation with pre-filters. This reduces the number of false positive from vector index scan. We then scan all vector indexes within the filtered segments and select top L for the entire partition. Here L is some value we choose based on the estimate based on the estimate selectivity of pre-filters. With that run pre-filters on these top L loads some loads will be filtered out. If there are at least K output loads then we are good and we just select top K among them. If there are less than K output loads we need to either retry step B with a larger L or if we decide the vector index scan is worse then just fall back to not using vector index scan at all. Basically, yeah, so basically to answer average question what we do is basically we use vector index scan then run the filter and if it produce less loads we just retry. We retry the vector index scan to produce more loads. The retry is fine because the retry is kind of local to the vector index filter and it will haven't write anything to the later output so we can just retry them. Yeah, so any questions about how we kind of vector index filter with filters? Yeah, we're good. Oh, question, sorry. Why would I... Go ahead. Yeah, because if we look at this SQL query the goal is that we want to run filters then after we run filters we want to order by the distance and we want to output K loads so we want to output K loads in total. But the problem is that if we execute vector index scan first and we output K loads then we run the filter. The filter out some loads then we output less loads because the selectivity of filters is unpredictable so there's just no way to estimate how many loads vector index scan needs to output. Yeah, so yeah, let's talk a little bit more about top filters. So we believe this is something that can not only be... This is basically order by limit is like a general push down. It can be a general push down and top filter can be useful even outside of vector index even though right now we don't know how to use it. So in the general case so the top filter takes in an expression by a filter tree and a limit K. The semantics of the top filter is that it evaluates to true if the expression of this load ranks within the top K among all the loads passing filters. So it is just a regular leaf node in the filter tree and it can be used even for not vector index scan. So in the general case there can be many top filters in the filter tree with different pre-filters. Filters outside of the top filter are post filters and the top filter itself has a pre-filters. Filter reordering can happen within pre-filter tree and post-filter tree. So one good thing of how we actually use top filter is that the retry logic only happens within the top filter. It's safe to do so because we haven't output any loads to the next operator. So now let's see. So we can even do drawings. Because the vector data is stored in SQL database it might be common to use vector search with join where the filters are applied on one table while the vector search is done on another table. So here is an example of doing that. So yeah, just read the SQL query. And so this can be executed by first scanning the table S with its filters. Then we build a hash table on filter loads. Then we do a vector in the scan with a pre-filter and the pre-filter basically just checks if there is a matching load in the hash table. Yeah, so basically we can do drawings with vector search. So this is something like the vector databases usually don't do. It is even possible that we can combine foothack search and the vector search. Yeah, so here is a semantics. So Elasticsearch also allows you to have a way to combine foothack search and the vector search. So here is a semantics and I want to show that we can also do it in SQL. So in Elasticsearch, each query basically the user can specify a query and each query contains multiple subcories and each subcategory has its own type. It can be either a foothack query or it can be a vector search query. For a given load, each subcategory produces a score. The final score is a weighted sum of all individual scores and the query basically selects the loads with the highest final score. The way Elasticsearch executes it is by executing each individual subcategory separately for the filter to select the loads that have a positive score for that subcategory. For the A&M query, it has a positive score if it's top K, otherwise it's not going to select it. It then you need all the loads selected by each subcategory and then compute the final score for all the loads output in step two and output the highest ones. Yeah, again, this is just a semantics from Elasticsearch. There might be other ways to combine vector search with foothacks but this is just one way of doing that. I'm just showing that we can do it. We can do the semantics in SQL. So here is how we can represent such SQL query such query in SQL. So know that there is a well-crossed or filtered that we only want output K loads passing those filters and we want to find the highest weighted score over foothacks and the vector search. And the execution plan looks something like this. We will basically do a table scan on T and the filter tree involves two crosses. The first cross is for foothacks index scan and it's end with the filters. So single store will be support is using foothacks with other filters. Foothacks doesn't have these weird semantics. So the second cross is for vector index scan as the filters are pushed down to the vector index scan. So we are able to construct a filter tree for this case and we can just execute that and the scores are projected out and then sorted by the weighted sum. Yeah, are there any questions for the drawing and the foothacks case? Can you go back to the join? Like, is the vector a column in a table and are you joining two tables based on how similar that column is? Like, I wasn't sure how exactly that works. Yeah, so here the drawing condition is basically, the drawing condition is not on vector. It's an attribute matching. Yeah, it's a good question that we also consider vector index join where it's joined by the vector similarity. I didn't put it in a slide but we also have discussed that internally. Okay. Doesn't really make sense to do joins on the vectors themselves because they're not human understandable but you call it some transformer then it spits about this, you know, I was in float vector. You don't know what any of the values mean. It's just like, for any input can I transform it to some vector in the same high dimensional space and then I do my similarity search. So joining vectors doesn't make sense of joining two tables based on something about the vector index. It makes sense, potentially. I see. Okay. I feel like, I mean, this is like, it's the elastic way, but it's like, I don't know the answer but it feels like it's the most inefficient way to do this. It's like, hey, let me run the one example. You're like, okay, I'm going to plot all the subqueries and run them individually even though they might be correlated and I could roll them into a join which is the best way to do it, execute it. But like, I don't think you can because this is weird semantics of like, yeah, find things that kind of match in this lucid, juicy way. Yeah. I think so. We are not sure what are other ways to put a vector search in SQL query. There might be, I mean, as you mentioned, the semantics of that SQL query basically enforce us to execute in that way. Yeah. There might be other ways to put a vector search in SQL query that allows for more efficient execution, but we don't know. Yeah. Thanks. Yeah. So this is the last page. So backlighting index can also be used elsewhere other than regular A and N vector search. One example is to do vector range search where we want to find all vectors whose distance is to a target vector falling a certain range. So this can also be done efficiently with vector index. Another interesting example is Maximal Marginal Relevance or MMR. In MMR, instead of searching K nearest neighbors, it searches for K representatives of nearest neighbors where new neighbors can't be too close to previously selected neighbors. So this can also be built on top of a regular A and N vector search. Yeah. Thank you. So I do want to mention that some of the things I talk about in these slides, we haven't implemented in 85, things like the backlight index module and the filtered A and N, it's not implementing in 85. But we do have many other things, most of the other things we already covered in 85. Yeah. Because we are a small team working on backlight index at a single store. And yeah, we also, but we do plan to support them or in the upcoming releases. Yeah. Any questions? Okay. I will applaud behalf of the, yeah, we'll explain what's going on. Thank you. Hi, we have time for a few questions for Chang. So what would, what's all sort of one sentence, somewhere you can give that differentiates like single store from like Postgres or MySQL or SQL Light or DuckDB, if some of the systems that the students may be aware of like what's the, like why would someone choose single store over another system? I think the main differentiator is that we are a general purpose database that can work for a variety of workloads. We can handle both operational and transactional workloads. We can handle like structure, unstructured and we have a lot of other features. Got it, awesome. Good. And then plus the scale-out style, plus the compilation. Yeah, yeah, the scales. It's very fast. It scales very well. And you gave a talk about mentioning lock free. I appreciate that. That was good. All right, guys. Again, thank you Chang for spending an hour with us. This is super, super insightful. All right, class. On Wednesday, two days from now, that'll be in person back at the lecture hall and that will start off with the review for the final exam. I'll post the website in the practice final, probably tomorrow. And then if you haven't yet, please go vote for what system you want to learn about. Again, you can't vote for single store because you just learned about single store. So pick another database system, okay? And then I will try to do as many data systems I can cover in an hour, okay? All right, again, Chang, thank you so much for spending time with us. I appreciate it. And then everyone, have a good rest of the day. Good luck with the rest of the classes. See ya. Yeah, thank you, Andy. Bye-bye. This shit is gangsta. This is all I'm trying to skate and that's your first mistake. I ain't lying for that cake, your fam, I see your weight. My granddad's heavy weight that ran through every stake. When he asking how I'm living, I tell him I'm living great.