 The databases for machine learning and machine learning for databases seminar series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Google and from contributions from viewers like you. Thank you. Hi guys, welcome. It's the last talk of the semester. We're excited today to have to finish things off with the guys from Chroma. Hamad and Likwan are from the founding team of Chroma and it's another vector database. They're here to talk about what makes them special and what makes them different. So as always, if you have any questions for the Chroma guys as you're giving a talk, please unmute yourself and say who you are and feel free to do this anytime that way they're not talking to themselves for an hour on Zoom. So with that, Hamad and Likwan, thank you so much for being here. The floor is yours. Go for it. Awesome. Thanks, Andy. And thanks to the course staff for putting this on. This is definitely something I wish existed. When I was in school, it's cool to hear from everyone and it's been really fun. I've been following along with the course throughout the semester and watching everyone's talks. So hopefully today we're going to talk about Chroma. Hopefully it won't be too much that you've already seen in other talks. We'll keep it interesting and talk a little bit about what we think is our unique perspective on vector databases and specifically our unique perspective on what's needed to build a end-to-end retrieval system for large language models. So before I start, I'll just talk a little bit about, here we go, talk a little bit about Chroma. So the project itself was launched in February of this year. It's been a tremendous year for AI. I know it's like peak of the hype cycle right now. But we've had roughly, you know, 850,000 downloads just on PyPy alone per month. And we're seeing like a million machines running Chroma per month. We've had, you know, been used in over 12,000 projects. And it's been really cool to engage with the open source community. There's like 80, you know, full-time contributors that are helping us a lot with like small midst fixes, giving us feedback and a core team of five and soon to be six, seven people. And I think when you think about retrieval, it's really important. And we think about vector databases in the context of language models. It's important to think somewhat about language models themselves. So I'll start with the basics, right? What is a language model? And naively, we can, you know, look at it this way, which is we have some query or some question and we get out some answers. It's really naive perspective of a language model, but hopefully one that makes sense to everyone. By now you've probably used things like chat, GBT or planning of the open source models. And, you know, you get some question and you get some answer. And really what this looks like, right? We might ask the language model, hey, what is relational algebra? And perhaps the language model has been trained on Wikipedia. And so it regurgitates to us the definition that we might see if we search relational algebra on Wikipedia, which is a theory that uses algebraic structures for modeling data and defining queries on it with well-founded semantics. But what does the language model actually do when we ask it this question? What it does first off is it takes your text and it tokenizes it. There's various tokenization schemes, but all of them basically amount to you have some fixed set of tokens, some n thousand potential tokens that you can turn your text into. And quite literally we just get an array of integers. That's what we turn the text into. And we pass that into a language model. And it tells us the probability for all possible tokens that we've trained the model on. What is the probability that we might generate the next one? And we do this repeatedly in an autoregressive fashion generating the entire spring. And this is important to understand because if you do something in this way, you're going to be quite prone to potentially generating the wrong piece of information. So a huge problem with language models is that they quote, store knowledge in their parameters. It's this parametrically trained system. We have a potentially deep nested sequence of transformer encoders and decoders. And we are creating a language model that learns the probability of certain strings. But there's some problems with that. We can't update them in real time. We can't update them deterministically. We can't provide provenance for their knowledge. So how do we get a language model to tell us where did you get this information from? And also they have a tendency to hallucinate. Because they're just modeling language, they're predicting the next most likely token. While they do, you know, people argue whether or not these things are learning world models. We can debate that separately. Please email me if you're curious about these things. I find them fascinating. But some people, you know, will talk about how they have a tendency to hallucinate. Because they are just predicting the next token. We are just always going to blindly sample the next most likely token in the most naive decoding scheme for a language model. So a technique that many people have been using is what's called retrieval augmented generation. And the idea is pretty simple. You combine parametric and non-parametric memory. And this comes from a paper from Patrick Lewis and team at FAIR, I think, like two years ago. And the idea is pretty simple. As opposed to just taking some text and giving it to a language model, can we now take some text, give it to a language model in addition to some documents that might help ground the language model's generation to generate some output Y. And more concretely, what this looks like is we have some corpus of documents we basically try and figure out some scheme to help us select some subset of documents. We literally can attach that to the input text for the language model, rerun our generation scheme and get some output text Y. And there's a couple of different decoding schemes for this, right? In the most naive decoding scheme, what we might do is generate the entire sequence full shot with all of the documents attached to the prompt. So if you're asking the language model what is relational algebra, perhaps you might give it some sequences that seem relevant to answering that question. And then you just have it one shot generate. In another approach, you might actually attach one of the documents and do many runs of the language model in parallel and then compute the marginal probabilities across all of those. In another scheme, you might even do this lookup per token. The point I'm trying to elucidate here is that there's many different small variations to the same strategy, and developers often struggle with this. Which one do I use? Which one will work best? How do I reason about what's better for my specific use case? How do I try them in real time or online and evaluate the differences? And these are all problems that Chroma hopes to solve one day. And specifically the way that we actually do this retrieval step, this problem of I have some set of documents and I don't want to get it. The technique that most people have been using is to use dense vector similarity search. So the basic idea is, and I'm sure you're familiar, we use some sort of dense encoding model, usually a learned model. When we map text passages to vectors, then we perform similarity search, which is usually just, you know, raw K nearest neighbor search for a given query. Similarity can be any metric in this case. For the paper that I'm citing here, it was the inner product, which is like actually they trained an embedding model on the inner product. And what you can quickly see is compared to like a naive ranking, not naive, but compared to a scalar ranking scheme like BM25, instantly the accuracy on many retrieval data sets jumps quite a bit. So for example, in natural questions, which is like a retrieval benchmark, which I think highlights fans of Wikipedia based on human annotations. We see that really quickly the learned embedding model performs a lot better, even after only exposed to a thousand training samples. And after we expose it to all of the training samples, it performs about like 20% better. And if you're curious, you can read more in the paper. But the point here I'm trying to make is that by using dense similarity search, you can get much more accurate and contextual documents for the problem they're trying to solve at any given moment. And what this actually looks like concretely is you input some documents D1 through D. And then you chunk these into smaller chunks, right? So the embedding model has some fixed window that it can actually take some piece of text and turn it into an actual embedding. Then we take each chunk and tokenize and embed it. And then we actually, because if you remember in the previous slide, sometimes it's actually useful to combine like traditional, you know, VM25 ranking with vector ranking. So you might want to index them in a scalar fashion. So perhaps you want to build a full text search index over the documents. Or perhaps you want to build some other sort of ranking index over the scalar components of your document. And then we actually take each chunk and we index its vector component. And then afterwards you might query these using the vector index and maybe you'll apply some re-ranking using a learn re-ranking model. But this is generally the pipeline that people follow from raw document all the way through to retrieve results, right? It's a combination of many different pieces, each with their own characteristic knobs and relevant strategies. And tuning these is where developers spend a huge proportion of their time and run into a huge number of potential pitfalls. And so our perspective on this is that there should be tools in the market that exist for people to not have to think about these individual pieces in isolation to build each system in isolation, but end-to-end systems that help you do all of these together in an easy way and to iterate on the system as a whole slowly over time, as opposed to having to take each problem decompose it into its own respective system and put them together manually. So there's a couple of use cases for this retrieval workflow that have been emerging for the last year or so. The first is dialogue. So we've all used chatGPT by now and you might just have some additional data that you put into the context of chatGPT. You want to have some dialogue with the model, but also you want to give the model access to personalized or private data. And so this data corpus is something that is personal to the specific user in mind or personal to the specific team that is using the product. Another use case is the one of agents. So this is more new and emerging, but one that we're pretty excited about, but the idea is you either have autonomous or somewhat managed agents that interact with external services and systems. So one thing that is like a small example, perhaps you might have some sort of system that manages your DevOps for you. You have an agent that understands your AWS setup. You can just say, hey, I want to set up CI CD for this workflow. It just goes and does that because it does how to look at your system. And then it stores the skills it learns in a vector database. And so this pattern of storing learned knowledge, storing learned experience and learned skills in a vector database is one that has been more recently started to become studied a paper that we've been excited about. It's called Voyager. It happened to use Chroma where they basically train a Minecraft agent to play the game of Minecraft and it stores its learned skills inside of a vector database. And another one is auto completion of predictive AI, which is basically like if you're doing the most, I think, relatable example for this audience would be imagine like GitHub co-pilot. So you're writing code and we're able to like predictively fetch the right pieces of documents, the right pieces of code in order to help ground the language models generation based on what you're seeing. So if you're writing some code, it might be useful to fetch similar code to the code that you're writing to help give the language model a few shock prediction of what code it should suggest that you should write. And these use cases entail a very specific workload shape. And I think the argument I'll make is that this workload shape is rather different than I think what vector indices were originally created for and what the vector search research I think traditionally focused on, which was very large scale, very high throughput data sets. And usually it was one index you built it offline, you updated in batch and this workloads rather different. We have a data set where the QPS is relatively marginal, right? Like we're not shooting for one index that has a very high queries per second. Most of our customers are okay with, you know, a hundred to a thousand QPS on as a given index. But the data is also very real time. It's not updated in batch. And so the data line that they're willing to tolerate is somewhere in the order of 100 milliseconds. And their data skills fundamentally relatively moderate. So the data scale is on the order of one to a hundred million vectors. We're not dealing with data sets, you know, a personalized data set for a chat dialogue, but it doesn't have many billions of embeddings that has somewhere on the order of one to a hundred million embeddings. But these embeddings might be frequently updated and deleted. So you might take for example the case of a document editing, you know, software, and it needs to be able to frequently update and delete pieces of documents as users are editing them so that an agent that is working with you in order to like suggest what you may or may not want to write is able to respond to the updates and deletes of the document. But I think the most, I think to me, the most large difference is the one of the index, the number of indexes alone in any given data system. So you might actually end up having a index per user. And so you might have on the order of like a hundred million indices for, you know, a reasonable workload at many companies. And you want a very high recall target out of these because you're conditioning a language model generation. And it's been shown that language models are quite poor at ignoring data that they don't understand to not be related to the task they're trying to perform. So it's very important that we can't really tolerate, you know, deviance and recall. And one thing that I think is also pretty different is the, because of the curse of dimensionality, the vectors that we're dealing with are quite large. If you go look at a lot of the literature on vector indices, the dead dimensions of the vectors tend to be quite small. So they're 96 to 120 dimensions. And these are most of the benchmarks you'll see in most of the papers that haven't put out in the vector database world. But the vectors that we deal with are much larger than that. And this actually makes a really big difference when you're actually doing the number of distance comparisons that you end up doing when searching vector indices. So the dimensionality we deal with is on the order of somewhere between 700 and 68 to 496 dimensions, which is much, much larger than, you know, sparse vectors or, sorry, much larger than the dense vectors you saw from vector search in older workloads. And what usually what people do, right, is you use some sort of approximate nearest neighbor's index to deal with the fact that you can't end up exhaustively searching the whole solution space. And broadly there's two classes of algorithms that are used. The first is inverted indices. So the idea here is you create some centroids over your data and you assign data points to a given centroid. Usually this is used in conjunction with some sort of compression scheme, usually product quantization. So you're compressing the individual points and then you're assigning them to a given centroid. All of them, all of these algorithms are basically trying to reduce the number of distance comparisons you're doing by pruning the search space that is eligible to you. The other commonly used algorithm class is Approximity Graph. I'm sure you've seen from other people in industry, which is we create a graph expressing neighbor relationships. These are often approximating a delinear or relative neighborhood graph. These are sort of the inspiration and the sort of mathematical intuition you can have there is that the delinoy, the delinear is a dual to the Voronoi and the Voronoi obviously is a good tool for dividing up space. And then the relative neighborhood graph is closely related to the delinoy triangulation of any set of points. And so that also gives us some intuition for how to think about why a proximity graph where any point is connected to nodes in its local region might be useful as a way to divide up space and search through it. So I'm going to talk a little bit more about inverted indices, which are a commonly used index type in industry. And what I like to do is walk through inverted indices and walk through commonly used proximity graphs and sort of talk about why they're not the best fit for the workload that we're targeting. And then I'll talk a little bit about what exactly we're doing and how it's somewhat different. So traditionally, when people use inverted indices, they do so with product quantization. So the classic workflow is you take your chunks of documents that we discussed earlier, they're of D dimensionality. We run some sort of clustering algorithm called K-means and we'll end up with some list of centroids. And then what we do is we quantize our chunks. So we take our chunks, they're D-dimensional. We do some sort of quantization. Perhaps we divide them into six subspaces and we get a dimensionality reduction of six. And then what we do is we take all of our chunks and we assign them to the closest centroids. We then find, at query time, we find the closest end centroids. So you have some query, you find your closest centroids, and we compute the distance to the assigned centroids. And that's how we know which posting list to actually search. And then we can re-rank because the quantization step actually reduces the precision of our comparisons for each posting list by quite a bit. So we recompute the full precision distance for each final list. And so this is a very commonly used approach. One of the problems with it is it ends up suffering a lot from recall. And the recall suffers quite a bit. You don't get the answers as high as accurately as you'd want. So what people end up doing is they way over query their index by an order of 100 to sometimes a thousand X, and then re-rank that list. And if you just actually do the measurement on the number of distance comparisons you end up doing, it begs the question of like, are there things we could be doing that perhaps might be slightly better than other people have found these techniques? When you say way over query, you just mean like limit a thousand? Yes. Okay. Like they want 10, though, to limit a thousand. Yeah. Like sometimes a hundred to a thousand X, the number that they want to actually retrieve. So they just set K quite high. Yeah. Thanks. And so the other vector index that's commonly used is a proximity graph algorithm called HNSW or Hierarchable Navigable Small Worlds. This is worked from Yuri Makov and Dimitri Yasunin. And the idea is really, really simple and really elegant. The idea is can you actually just build a traditional proximity graph? So you have some graph and without getting too much into the details, because what really matters for our conversation today is that it is a graph and graphs are poorly stored on disk oftentimes. And so because it is a graph, what they do is they build a proximity graph and they inspire by skip lists. They build layers in this graph. So the top layer of the graph has the longest range of connections. The bottom of the graph has smaller and smaller connections. And the way you can think of this is that the diameter or the radius of your graph is shrinking as you go down, the number of hops from any one node to the furthest node is decreasing. And the other trick that they employ in addition to doing this hierarchy of graphs, which makes the search go from a polynomial time to logarithmic time is they have this heuristic, which I think is really important to pay attention to, which is oftentimes you end up, as you're building proximity graphs, you end up with points that are on the edge between two clusters. And so they have a heuristic that will encourage cross cluster connections. And this sort of problem comes up a lot when you're partitioning space and is also like we'll talk a little bit later about how the same heuristic you can apply to inverted indices and end up with something that's a little bit better than the traditional inverted index approach. So what I think is really important for our conversation, even if you don't understand how HMSW works at all, is that it's a graph. And graphs are not that great to be sort on disk. The access pattern is really random. And people have done work, for example, on disk A and N, where you can build the graph in such a way that it's more amenable on disk, but at the end of the day, these things are really hard to reason about. The fundamental access pattern of a graph is very random. The graph is actually designed to give you access. If you divide your graph into some number of pages and you look at the page access across the graph, you end up touching a large large percentage of the graph in terms of the number of pages you touch when you do a query. And the problem with this, so what people do is they store the entire graph in memory. And that quickly becomes a bottleneck both from a cost perspective, but also just from a usability perspective, like how do you think about deploying something when another problem with HNSW is because of its design, deletes tend to degrade the graph. And the commonly used solution for this is you just end up rebuilding the HNSW index entirely. So let's say 20% the graph gets deleted, we'll just entirely rebuild it. And then inverted indices also have their own set of problems. The caveants clustering step often poorly partitions the space, and so you suffer from low recall. It's also very hard to think about how many queries you have. And then it's often used in conjunction with PQ, and then this has this over clearing problem that I mentioned, and this is very hard for people to think about in terms of tuning. Every different queries might have a different number of probes, a different number of centroid lists that you actually want to search. And also if you have real-time data, which is quite important to us, when you do the centroid step, your data can drift quite a bit, right? If you are indexing data over time, the centroids are only building these centroid pieces at a time, and that's going to lead to drift in your actual centroid clustering set, which makes the inverted index have poorer recall over time. So there's some interesting work done at Microsoft about two years ago, and they've made this algorithm called SPAN, where they take the idea of how can we combine these two approaches, which is we store the centroids in a graph in memory, and then we keep the posting lists on disk. So we do our centroid force over the centroids. We store them in an approximate nearest neighbors algorithm, and then we store the actual posting lists on disk. And what's interesting here is they keep a very high centroid count. So 10 to 20 percent of the data is actually put into centroids. So if you had a million scale dataset, you would have 100,000 centroids. This actually saves you a lot of memory and lets you push a lot of the workload to disk. But also what happens is now you have a very fast layer in memory, and you can keep your lists very small. And when you are very small, because you have a very high centroid count, it's very easy to reason about the disk access pattern, because you know exactly how many are going to be accessed. You can plan your IOs really well, and you can fetch these posting lists off disk quite effectively. And then what they do is, the other tricks that they employ, you'll recall that I mentioned this notion of points on a boundary. So what you can do is you can assign points to multiple centroids. They call this closure clustering. The idea is really to a centroid, when you actually do assign it, only assign it to another point. If its distance to that point is less than the distance to the closest point, plus some epsilon. So we only want to attach a point to another point. We will attach a point to multiple centroids if we feel like it might be on the boundary between multiple centroid clusters. And then we also follow a relative neighborhood graph rule for centroid assignment. So we don't want to overpopulate all of our centroid lists if a point is overly specified on a boundary. So what we do is we say we're going to skip assigning a point to a cluster if the distance between that point and the cluster is greater than the distance between that point and between the two clusters. So in this diagram here, we have this orange point. It's being assigned to potentially this bottom left cluster and this bottom right cluster. And we don't actually assign it to this bottom left cluster because the distance between the two clusters, the cluster it's already been assigned to, is less than the distance between this point and the cluster itself. And that means that the cluster is probably overlap. And so at query time, when we go to look up a point in this region, we're probably going to end up looking in this list and this list anyways. So it doesn't make any sense to store this point twice. Really simple rule, but if you actually go do the quick analysis of how many lookups it saves you, it's quite a bit. And then the last thing they do is they observe that not all points need to visit the same number of centroids. So they apply the same pruning rule we applied for building the centroids for assigning points to centroids. So they apply it at query time. And the observation is that different queries only need to access a different number of centroids. And so you can actually prune the centroids when you're doing the query using the same distance rule. We won't query a point unless we find that the other centroid that we want to query is within some epsilon of us on the centroid that is closest to the point. And this these three tricks make a huge difference in terms of the recall performance you get out of inverted indices. And so what you'll see is on the bottom here we have a graph where the blue line is traditional inverted indices. I forget what data set this is to be frank but trust me the results hold. You have a blue line that you know is just the result of traditional inverted indices and then what we do is we assign it to another four centroids and then eight centroids and ten centroids. And really quickly you can see a very large jump in the recall performance with this really simple trick of dealing with points on the boundary. And then another another small trick that we've been playing with is called 80 sampling and the observation here is that a bulk of the time searching is spent rejecting candidates based on distance comparisons. And the idea is that when you actually look at the time spent the majority of the time you're spending is spent rejecting candidates. So we have graphs here of distance comparisons on negative operations in HNSW and IVF. You can see the bulk of the time these DCOs pair operations is spent on rejecting candidates. And the idea is that when you're actually sampling your graph all you care about is the point I'm looking at further away from the closest from the you know the top of the stack of the closest points I've seen. Is it worth adding it to my top key list? And so what you can do is instead of having to do the full distance comparison can we just compare a subset of the dimensions. Intuitively like if you have a point in the Euclidean space and these two points are really far apart you don't need to actually do the math for each dimension. At some point the math for some subset of dimensions meets some threshold where you can just prune that point from your search entirely. And so what they do in this paper is they use a result in high dimensional analysis and it's the Johnson-Lindenstrasse dilemma which lets you basically say that if you're going to take a point and apply some random Euclidean transformation to it you can then prune that point from you basically can bound the distance of that point the norm of that point by some epsilon. And so they invent a hypothesis and they come up with a hypothesis testing procedure based on this bound from the Johnson-Lindenstrasse dilemma that lets them basically flexibly when they're doing their search only compare a subset of the dimensions for any given point and this ends up speeding the search up quite a bit. And so our index combines these two approaches. We divide the data into real-time and historical segments. We index real-time data into an HNSW graph. We compact historical data into inverted indices using the few tricks from the span paper that I just mentioned to you pleasure clustering and a neighborhood graph in flexible query sampling or flexible query pruning and then when we query we use AD sampling in order to reduce the overall distance computations required to do any one query. We store centroids and HNSW and then we store a posting list separately and allow and using an inverted index approach while increasing the latency of our queries a little bit allows us to separate the storage and compute layers of our architecture and this leads to a lot of benefits for both developer experience but also just general operational flexibility. And so now I'll pass it off to Laquan to talk a little bit more about the core architecture and how the index actually lives inside of the full distributed architecture. Yes, cool. I'll share my screen. Before we go to that part, quick question. That was excellent so far. This is really great. Could you shed a little light on how much you go into quantization? I know there was a slide you presented a few slides back that said in general people do quantization, how aggressively do you use that in your overall scheme or if you're going to answer that question later, you could defer it to Yeah, that's a good question. So we actually don't use quantization at all and the reason being that it is really hard to reason about the performance of your system when you use quantization and so what people end up doing is they end up tuning the quantization scheme or the over query scheme to deal with the recall loss that you feel you face in quantization and the problem is if we were just running our own system I think that would be easy for us to reason about but because we have a lot of developers who are being exposed to these concepts for the first time we've been in search for a solution that doesn't leverage quantization and so that's why we have these sort of other set of tricks that allows us to avoid having to quantize the data at the expense of some latency obviously. Thank you. Hi, a quick question Martin Promer from University of Wisconsin Madison. So you mentioned a lot of talk about using the distance metrics. Have you looked into any of the alternative approaches where you specifically train a separate model to do all of your distance calculations and then you rely on this distance measuring model instead of the more traditional approaches? Yeah, that's something that we've been interested in but to be honest had not spent that much time investigating and personally also really curious about how those approaches perform but we haven't used them ourselves. Okay, thank you. I think we can get started. Thank you for the introduction of the problem, the workload and the index we are trying to build. I think this is one of the cornerstones of vector database and also like one of the cornerstones of the retrieval architecture for large language models. I think another cornerstone really the distributed protocols that we build in order to ensure the safety and the live needs properties of the system. We will talk about a few problems and the solution we propose to achieve safety and the live needs but before that I'll quickly talk about the chroma architecture and I'll start with a single node. The single node of chroma is pretty simple. It has a server content of front-end and also a metadata index and a data index and then when people send the nearest neighbor queries they first go to the metadata index and then carry the vectors and then return the results. There are a couple highlights of the single node like architecture currently we are using the HNSW server index and also we use SQLite as our metadata index and we also start the system level catalog. This is the single node version. In the distributed version we basically decompose this component into several services. For example we have a coordinator that basically the system catalog and we will introduce a front-end servers which are kind of like the proxy serving sending queries to the logs and servers depending on whether it's read or write. We introduce a distributed log as a database and the distributed log basically is serving the writes and the raw writes will go to the distributed log first and then the index node and query node will tell the distributed log and then build the views and the indexes incrementally. Next I'll talk about the read and write paths for the Chroma. So the write paths will first go to log and the index node will tell those logs and once it reaches a certain side threshold it will flash to the object storage like S3 and then that's the write path. The read path will be like the front-end servers will figure out the assignment of the active segments and the header segments and the raw information to the write node. So we are using a harshing scheme to reduce the global coordination. So basically the idea is that basically the distributed log like Kafka or Paul there have the concept of topics and then based on the topic ID we will assign the topic to different nodes and for the header segments we also have the IDs and based on the IDs we will assign them to different nodes. Here I want to talk about a few highlights of the system. It's very different like today compared to 10 days before ago that there's a lot of ecosystem we can leverage to build distributed systems. For example we can use object storage we can use distributed logs and we can also leverage Kubernetes to actually manage the membership and and do notifications. So that is our one principle basically we want to leverage ecosystem as much as possible. However I also want to build like critical capabilities like kind of information propagation protocols as well as like topic reconfiguration protocols to ensure that the system has certain safety and allowing its property. And we highly leverage currently in our system to reduce global coordination. Yeah next talk about one problem we are solving which is called we call this routing information propagation. So in our right path basically we have like one active segment that tails a log and then it increments the data and then build indexes and once it reaches a certain threshold it becomes a historical segment. You can think about like all the segments generated from the active segment and I think this is important for us to making sure that during the segment generation splits the final servers will always get the most up to date information so that query will carry all the data for collection so that we don't have like a data miss or have a partial query during this active segment split or no additional no deletion and the protocol we are building is we would build a monotonic protocol to ensure no query to credit data because from the server because this is essentially a asynchronous system and what the indexing node does is that as it tails the data as it flushes to the disk it will register the membership to the coordinator and the coordinator will broadcast the new term of IPEC to the front server. The term is a class wide term indicating a new assignment to the member list and the term is only incremented in response to node additional removes and IPEC is a per-collection concept it means that each time when we split or we generate a historical segment the IPEC will be incremented because the system asynchronous the front server may receive the updated terms or IPEC in delayed fashion similar with the workers. How do we make sure that the system always carry the right information and we basically have a pull and push as well as a reconciliation rule pull means that when the new front server starts it will pull the coordinator for the latest term and IPEC for certain collections push means that when there's a new membership or there's new kind of like a segment generally will push the updates to the old front servers and workers. Reconciliation means that all the queries will attach the term and IPEC observed by the worker and then it will check whether the IPEC and the term in the query is smaller or bigger or equal to the IPEC on the worker. It is smaller which means that the front servers are lagging behind and it needs to query the coordinator to fetch the most up to date information. I think with pull push and reconciliation guarantee the query always carries the complete data for the collection in asynchronous fashion. Yeah, this is one problem which I thought the second problem that I think are facing as Hamad mentioned we are dealing with large number of collections and also even we are using the colorful pausers they have their own limitations on number of topics which means we can now just simply assign one topic to one collection of one topic we have to use some sort of like a data multiplexing meaning that will be assign multiple collections to a set of topics and then but the challenge is that how do we make sure that as the system grows, as the number of collection grows we can resharp, reconfigure the collections over topics and this is in general a pretty challenging problem and the problem is basically like like that during this reconfiguration because the order of rights is very important because you don't want to delete the data show up again because of reordering so you have to guarantee the orders of the rights even in the in case of changing the number of topics and we have designed a protocol inspired by the paper 9 basically to ensure that during the topic resharp reconfiguration the data observed by the indexing workers are ordered by the configuration and offset from the topics and this is critical to ensure the correctness of the data and the user experience and here's the high level view of the protocol it actually basically solving two type of problems why is the reconfiguration or resharp decision and we use the coordinator to make the decision and once the coordinator decide we have to do a topic reconfiguration it has to be executed this is similar as the prepare backstage in the two phase commit protocol and then the coordinator broadcasts a new topic list to the producer and when the producer receives those like a new topic list and new configuration it will communicate with the consumers by writing a message called FIM to the topics indicating that it will not produce messages with the previous configuration and then the configuration the consumers will buffer the messages and then one it will receive like FIM messages it will know that which collections I will not receive the messages for the previous configurations then I can deliver those messages to the in-depth worker after I deliver the messages of the current configuration for this collection here are some examples it's like two producers are responsible for two collections and in the configuration with there will have one topic and both of the producers will produce to the old topic and the consumer basically will consumer messages on this topic and then we want to change this configuration to use two topics and the protocol is like that so the coordinator make decisions and it broadcasts to the configuration change to all the producers and when the producer receive the messages you send the FIM to the old topic for example let's say producer one responsible collection one and then once consumer receive the message from producer one it knows that collection one will no longer have messages for collection one with configuration with there this is very important and then it can expose data to the in-depth worker and once it's received the producer two the same message from producer two then it can expose information for collection two and this is very important this is not going to work all the time but we do need to have some like assumptions basically the assumption we need to make is that the consumer know how many producers are producing to this collection and this cannot be changed during the configuration change and the way we can do it is to disallow node addition and the node removal during the reconfiguration stage and also we can use some kind of static membership mechanism in case of like node crashes we recover the node into a state that is the same as the before in this case we can make sure that the configuration we can have like the configuration change without any blocking of the producers and without blocking to the consumers so this is like a protocol we developed and we are going to verify the TR Plus so our philosophy is basically like we think that because of distribution system all about reaching consensus and liveness without with minimum I would say like synchronization between systems so we need to make sure that those protocols are right from day one so that we don't run into situations that we have to do patch and patch again of the system of the protocol that is not fully designed so we will make sure that those information protocols are designed and verified on day one with TR Plus so that we make sure that the protocols are very solid and then we will be implementing them the other thing we are doing focusing on correctness is that even the single node chroma we take a lot of like effort implementing the property based testing and the model based testing and the idea is very simple it came from like the light with formal verifications from AWS and we basically have a model for the metadata and model for the data and we have like reference implementations for example like in-memory map for the metadata and we have like portfolio index in-memory for the data and then compare we create a state machine that randomly prop form operations both on the reference implementation as well as the real implementation and then compare the result within third and bound and we actually did quite this is quite helpful in terms of improved our product capability and we fixed quite a few bugs with those kind of like property based testing and model based checking and I think this also serve as a foundation when we build the degraded systems because we don't want our system to you know, degrade quality or have like correctness issues and this is a solid foundation for us to move on. Okay, next up I'm going to talk about node maps and I think as Hermann mentioned we want to be an end-to-end system for retrieval for language models I think white search is only part of it and retrieval itself is not sufficient so if you think about building application with language models there are three kind of like things we are thinking about three components the first one is really the logic for the chain of like prompts the second one is data including the retrieval as well as your structured data and third thing that is very important is also like the feedback evaluation so that you can find the right strategy of chunking and batting kind of re-ranking so that you can have a right kind of like production grade language model for your application so this is with like node map is we want to expand from API to allow user defined function for chunking embedding, indexing and re-ranking and also we allow more sophisticated retrieval strategies to be pushed down closer to the data. Right now I think Chroma is focusing on the vector search part but later we also want to include the APS of chunking, embedding storage and re-ranking and in the following slides I'll talk about a few examples and the first example will be retrieval evaluation so you have some query and you have some retrieval result and also the generation result and there can be some feedback around those result and we can directly save those feedback into Chroma and this feedback will be useful in many kinds of use cases. You can be used to finding the embeddings and different layers and different steps. For example, this feedback can be used to finding the model this feedback can be fitted to the adapter matrix and we can also build re-ranking and filtering mechanism based on those feedback and based on those feedback people can build better like Antoine systems and another thing Chroma is aiming to support with the pipeline API is allow users or automatically generate the best kind of chunking strategy the chunking strategy is very important but it's often overlooked but however the chunking is very important because it is actually tells the system what information they need to embed and there are many historical chunking strategies based on the document structure but however what is the optimal chunking strategy depends a lot on the task and dataset and I think Chroma is experimenting with the use of language models and larger approaches to perform optimal chunking at data injection based on those feedback and also Chroma can also include the re-ranking model as the end-to-end retrieval stack and the relevance really depends on the specific query or tasks as well as the underlining site so after we retrieve the data I think we can allow users to experiment with different ranking models to generate the result that has most relevant results and that's all, thank you I will clap for half of the audience we have time for a couple questions if the audience wants to go for it I was wondering connecting back to some of the discussion earlier on to Chroma do you had mentioned how the vector sizes that you use can go up to 4k and that's driven by the workload could you and Ninkuan tell a little bit more about the diversity of the workloads that you see obviously people get those vectors probably acquiring some of the LLM or some other method to do that is that what's driving it or is there something else that's going on with the diversity that you see in the vector sizes I think it's mostly coming from the large variety of models that we're seeing in the open source landscape so nowadays we're seeing video embedding models for example the 4k embedding models that I mentioned are video embedding models and those just have by their nature a much higher dimensionality but they also have a temporal nature to them which makes them difficult to reason about sometimes so the first is multi-modal embeddings and the second is there's a huge resurgence of open source embedding models so people are pulling them off the shelf and these are often trained where it's a language model and then some part of it is removed and then fine-tuned at the end or it's an embedding model trained specifically for this purpose and just the nature of the architecture leads to a much higher dimensionality now to say we have seen that you can reduce the dimensionality quite a bit and get the same performance so I do think there's a lot of headroom in dimensionality reduction when you have a specific task and domain in mind and I think that's very much in line because it doesn't make sense for you to have an embedding model that is generically good at all of human language if all you're trying to do is do code retrieval it makes sense for you to fine-tune an embedding model down to a much smaller dimension and focus it on that very narrow task so I think that's kind of where things are going is we'll have these foundation embedding models that are trained on a wide force of data and people will fine-tune them to their specific task and then that process can also do dimensionality reduction and it's kind of silly sometimes simple it is to do that dimensionality reduction you really can just add a linear layer to the end of your embedding model and fine-tune it down to a smaller dimension and get surprisingly good results with just simple binary classification as your training data set so that's where we end up seeing a lot of people do Any other questions for the audience? Oh I think you're muted Sorry, two more questions one is how big do some of these databases get in terms of number of vectors on some of the larger end applications that you see and then I'll wait for the answer for asking the next one Sure, I think that the biggest data sets that we've seen are like right under a billion data points that's like the biggest data set that I've pursued at Dex from the Chroma but that's like very much on the tail case so I think the large majority of our data is on the order of several hundreds of thousands to several millions of vectors but then again the key part is that they're growing quite frequently both in terms of additions but then also in terms of mutations both updates and deletes and that leads to some challenges in the underlying vector index the algorithm that you use but yeah I think several millions of vectors seems to be where a lot of these things are converging and I think that comes out from the fact that when you end up chunking a reasonable corpus of documents you know say like a hundred to a thousand documents when you end up chunking them and then processing them you end up somewhere in that regime and so for most of the use cases that we're seeing be it chat your documents so you have some internal knowledge base that you want to chat over or like code retrieval or you know augmenting a language model with knowledge specific about some new tool that you're developing so it can assist users with that task those corpus of data tend to be not trivially small to where you can force but you know not insanely large where we have to start getting into how do we do billion scale vector search on a single node great and then the follow question was do you run through issues where someone has a longstanding application where they're not just creating these vector databases for a small amount of time but they're really building on it more like a database and they go ahead and switch the model that they're using for embedding maybe they switched over from GPT 3.5 to 4 do the users take care of migrating the database to this new embedded model or do you provide tools to allow them to do that yeah this is something that we've been thinking a lot about we don't have any explicit tools to help with that today but I think there's sort of three flavors of problems that all align with this the first is the actual language model drifts over time without you knowing so if you're using some API service provider the behavior of the language model might change in and of itself without you even knowing and that can be a bit strange to monitor for so I think monitoring tools are something that the ecosystem is working on and perhaps might really sure and there's a really good paper actually from Databricks where they they measure like over three months how does the performance of a language model on a fixed benchmark of tasks change over time so I think that's the first class of problems and the second class of problems is as you mentioned it like you change the embedding model yourself and you have to sort of manage you know these different versions of the same task workflow that you're trying to perform and I think that in that case the way that we think about it is maybe doing that both in you know for experimentation but then also for valid use cases where maybe it makes sense to online A.B. test two different embedding models and perform you know a comparison and then see which one's performing better maybe you fine-tuned something and you want to run an experiment so I think in that case we do think that there needs to be a way to for the system itself to handle that for you and I think that's sort of an experimentation of platform that could live inside of Chromo but then also of course you could just do that on the model and I think the way that ends up manifesting is the fundamental data structure or like data model Chroma is a collection of data and you just create multiple and then the idea is like how can we somehow allow those to share indices to start but then if you mutate them or change them over time can't you know do some sort of copy on right scheme so I think that that's that's one potential approach is you don't need to you don't need to duplicate all the data just the data that matters for your specific variation so one of the problems that people run into is they are changing other strategies so they're not changing the embedding model but perhaps they are changing the actual they're changing the chunking strategy or they're changing the re-ranking strategy or they're actually just updating a document so a huge problem we see is like I took this document I broke it into 10 chunks then I actually update the source document what's like the best way to update all the individual chunks of the document without having to redo all the data based a bunch of compute in the case of very large documents how how do we do like a different embed is like an active problem so I think those are sort of the three areas where we see some sort of drift in either embedding model providers or language model providers resulting in issues or actually just drift from the person using the data based themselves potentially causing issues and I do think that it's necessary that whatever system emerges to solve these problems hopefully Chroma makes those things easy for you I just loved how you had so many connections made to papers and it was just fantastic loved it awesome talk any other questions from the audience so I wanted to go back to the sort of the distributed protocol you're running to do the reconfiguration is this something specific to Chroma or is this like a just a limitation in Kafka that like you do this reconfiguration and you need to update a bunch of people that are feeding from it in a sort of manner yeah this is a limitation from Kafka I think there was a proposal about like topic expansions in Kafka probably in pause as well but this is in general a pretty hard problem in the sense that you need to coordinate the producer, the broker, the consumer so I think this is like how we have to build on top of limitations part of the Kafka I think the on a high level the intuition is that assuming that you are dealing with even time processing with out of order like messages and the message can arrive kind of like fashion that can be delayed but you want to make sure that for example you want to compute the number of like the spend on an hourly basis and then the hourly basis can really arrive late maybe in 24 hours late you basically need to keep all the windows open before we can send the right answer to the downstream this is how this is basically the paradigm of fling or like spark streaming but the idea is that you always output the correct answer to the downstream and the way to do it to figure out what they call our frontiers which is the latest of the lower bound of even time stamp that will happen in the future once you know the frontier then everything before this time stamp is kind of like you already know the answer and you're not going to change everything you send out to the downstream will be very vocal the idea is that this is basically a high level idea and the way we do it is to make sure that we know all the producers are producing this collection and we know that we need to keep track of how many producers already make the configuring change and then the producer once they are actually sure that they have received all the kind of acknowledgement from producers that we are no longer producing the data before then I can safely sending data to the downstream without violating any like all the end guarantees yeah I mean how expensive would a stop the world be in this configuration? in this case there won't be any big stop the world the first step you need to know because we probably will use two code us for the normal case as well as the reconfiguration case I think the only thing you need like two stage consensus is we need to make sure that everyone is making sure that we are executing a configuring change switch to the configuration change code pass similar as that is only after that everything is asynchronous if you make sure that there's no like no addition or removal or crashes got it okay yeah and then in the beginning of the talk you guys went through a bunch of different optimizations that Jignesh said we appreciate you guys putting the citations to the papers you didn't show it's a possibility to show for all those optimizations which one gave the most bang for the buck in terms of engineering or even compute time like if someone is going to implement all those optimizations which one should they start with first yeah I think the biggest one is if you're doing inverted indices the closure clustering what they call closure clustering in the span paper where you sign a point to multiple centroids based on adaptive condition that changes inverted indices from something that is very hard to reason about the performance of basically working almost as good as HNSW in those cases it was quite surprising to us when we implemented it it's like a very simple idea like you have points on the boundary just to sign them to multiple centroids it's very simple heuristic but it changes the recall performance by like 20-30% in some cases it's quite a large and drastic jump