 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySeq call and post-grace configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Hi guys, welcome to another talk on the Vaccination Data Center series. We're happy today to have Dr. Edo Liberty, who is the founder and CEO of Pinecone, a vector database that he's going to tell us about today. Prior to that, he was the director of research at Amazon and the head of the Amazon AI Labs. For that, he was senior research director at Yahoo. And then prior to that, he got his PhD out of Yale University. So as always, if you have any questions during the talk, please unmute yourself, say who you are, and ask a question and feel free to do this at any time so that Edo doesn't feel like he's talking to himself. And with that, Edo, the floor is yours. Thank you so much for being here. Thank you. Appreciate the invitation. And very happy to virtually be back at CMU. I've been there many times. Mostly at the CS department, but nevertheless. I want to talk to you about vector databases and this talk is really, really isn't about the applications of them or the commercial viability or the use case, but really like diving deep into how they're built and what's challenging about building them and why it's interesting for somebody who's an engineer or a scientist to care. I hope that makes sense. So I will start nonetheless with saying what it is because otherwise it becomes very hard to dive deeper. We as a community have by large search engines and databases have done something relatively simple in the sense that the objects that we care about are text tables, integers, strings, you know, maybe objects like JSON and so on. By large, relatively simple objects. And the questions that we were able to ask us to answer efficiently were actually relatively simple questions. What documents contain what words, you know, what, what records answer some, you know, some predicate, you know, what, you know, what records fit some logical filter and so on. These things are incredibly hard to build efficiently at a scale and so on this entire seminar and this entire, you know, endeavors, you know, and many hundreds of companies are dedicated to doing that efficiently. I'm not trying to believe that that's incredibly important. But for 80% of the world of the world's data. That doesn't fit, you know, if you have long pieces of natural language, natural spoken languages, images and so on. They don't fit the pattern. And if you have you want to answer complex questions, you know, give me everything that means something with similar to something else that that is a complicated question and it doesn't fit into an SQL query. And so, interestingly, the answer has emerged, you know, we have wanted to do this for many years it's not like people didn't want to answer complex questions 30 years ago, we just didn't know really how to do it. And today, we are, we are pairing up machine learning models that create special representations of objects as vectors and vector databases as the engine that drives search and retrieval off of that type of data is basically the de facto winning paradigm on how to address those issues. Right. And so if you look at the adoption of different technologies, I'll take Apache droid as a benchmark if you're on the right hand side you see a number of GitHub stars as a function of time. GitHub stars are pretty poor surrogate for adoption or usage but some surrogate nonetheless. And I take joy because it's something that I, I, I, it's a cool technology that I like I think it's a good, it's a very good piece of software has very healthy adoption and good community. And it's growing very healthily and supporting a lot of business and news cases. If you look at something like face and birth which I'll take so in a second what they are, they're growing they started later and they're growing significantly faster. So what are people doing with it. They're taking sentences in English or any other language, they're using a machine learning model in this case something Bert which is the yellow line. They convert that to a high dimensional vector in this case 768 dimensions and feed that into a vector index in this case face which is the Facebook AI similarity search library, which is the blue line on the right. Okay, and that model has repeated itself that my model is the birth model but this paradigm this pattern has repeated itself. I don't know how many times across the industry. And, you know, with different models and different vector indexes in different applications. Okay, so that the open source fear has in some sense, solidified about this being the right pattern. More than that, the bigger companies have already voted that this is the by far the winning strategy. Right. So if you look at if you search on Google today with a long question or something that the text search engine can already take that is going to give you very poor results. What they're doing this, they search by meaning and use the search behind the scenes to actually retrieve answers or conflict complex results. When you shop on Amazon, you might look for a full outfit, and you might get some stuff. And that's not because the words for an outfit were in the description of the products. They were, they seem like a good match. And the way that happens is because neither the product nor your query represented as keywords or JSONs that represented as high dimensional vectors come out of the machine learning models. And if you, you know, you browse Facebook, then your feed is organized again based on your preferences in your behavior. And that is again the same kind of pattern you yourself and your behavior are encoded as this machine learning representation of your preferences and items are searched through a vector that it's okay. And so this is going to be my my one and only slide as a as a CEO. I want to say that pine cone is pioneering this kind of database in the in the in the SAS category around it. We've built three you know conceptually separate parts of the database, even though they're very closely related and of course operated together. Of course, it's the vector index itself, which needs to be of course fast and accurate. It needs to contain what's called filtering we shall talk about in a second. And of course it needs to be very efficient so you can, you can get a lot of help will cost reduction, which of course we're all over to our customers otherwise there's no point in having a database to begin with. You have to then have a cloud database so you have to scale the whole thing care about ingestion care about high availability load balancing, you know, separating storage from compute, you know horizontal scaling and replication and so on. And finally, managing security resource provisioning access control product management the whole nine yards. Okay. We have been recently called out by Cardin as a cool vendor and you've been a lot of the venues have talked about our product and what it does for the community and businesses with large so the rest of this talk I'm going to talk only about the top point so I'm going to only focus on the vector index and and and the core of the database. I won't talk about any other thing I know security is very important but it's it's not a part of this seminar and I believe this is not what you guys would be interested in. That's great. The cloud is just like, whatever it serves for something S3 EBS. Right, so I mean it's not. Yeah, I mean we are. Yeah, it's more it's more standard. It's still a lot of work but it's it's that's not where the secret sauce is. So just as a very rough kind of mental model of what we're trying to achieve. What is possible with pine cone today and then I'll dive in. So this is a Zeus is the director of products pet and really cute dog. I can attest to that. Not only Dave thinks that Dave wanted to find his siblings and so he doubt not having jokingly as siblings like other dogs like him. So he downloaded a data set of I think 200,000 images of dogs and absurd that their embeddings with the computer vision model into pine cone as the vector of presentations and then search for the top five results in terms of similarity to the vector embedding. And the results were these other dogs from this collection. Right. And so you can see that the similarities in terms of content of the image and the meaning of the image and not pixel similarity or some other metadata related the search is on the image itself. Makes sense. So, all right. By the way, it's much more common to do this with text and language but the text and language doesn't don't let themselves decide nice visuals. So it's easier to talk about images. Okay. So that with that out of the way, I want to immediately jump into what is a vector index and how does it work. So, I want to have a recap of machine learning one on one. I know if you've I think all of you have seen this mental model of a linear classifier that kind of divides the space between positive and negative examples. And you can think about the vector embeddings of the images of dogs as really the features of the image. Right. And you can think about training a linear classifier, which is basically these weights are key why. Okay, and this parameter be such that if you sum up the coordinates Xi so these are, let's say you have 1000 coordinates here so each each image is represented by 1000 dimensional vector. Then if you sum up the, you know, this the dot product basically between your query or whatever this classifier to you and your point X. If that is greater than some special to be, then you're on the right upper side of the space. And maybe you'll classify it as a white Labrador and maybe if you want to lower the bottom part of this plot in your categorize it's not a lot of not the white level. Okay, so that is that is a machine learning concept that you can learn, right. But like in database as a whole, you know your job is to retrieve the things that classify correctly or that meet the predicate efficiently, and not compute everything in grid for so let me maybe go back to the first slide, I can easily figure out all the white Labrador by just computing Xi, this x times q parameters everywhere. But I would have to scan through the entire data would have to compute the different coordinates times n items in my data. So that's n times D. Okay, so I can scan my entire data and figure out which slide, which point lies on which side of this hyperplane, but that would be incredibly heavy. And so, in, in vector databases. You kind of have to get used to the idea that a query is a classifier. What do I mean by that. In vector database, you basically have a predicate, right, and the predicate takes a record, and it tells a Boolean, and your job is to be able to return everything that returns true, very efficiently, right. In, in vector databases you get the high dimensional vector and you return a Boolean, right. That pattern of a function is called a classifier that's what classifiers do they take up they take a, they map vectors to Booleans. Now, like this is the label, right, it's a Boolean it's a binary classifier, right. And so your query really is a classifier, and the result of the query is everything that classifies as true with this classifier. You can't have very general classifiers this is a very hard problem, but you can deal with linear classifiers you can deal with what's called cone regions, which are usually known as cosine similarity. So cones are just, you know you point on directing in space and look for everything inside some, you know with some angular distance away from that rate, or maybe a ball region which corresponds to Euclidean distance, which is like you put a point in space and put everything within some radius around this point in space. That which is also very common for, you know, things like similarity search or dissimilarity search. So these are the kinds of queries you want to be able to answer. Does that make sense. All right. Go on. It makes sense to me. Right. So note that right off the bat, we have completely abandoned SQL, and even know SQL, and we're going to completely different domain where we deal with a very new kind of type of data. Okay, so we make the whole, you know, you know, this is this is the core of what we're trying to. So I want to say that intuitively, people, it's kind of hard to understand why this is hard at the beginning, because our minds are used to like two or three dimensional spaces, in which things like divide and conquer work really, really well. So in one dimension, you can split the space in half, like you split your primate space in half, and you get two sections, right. In 2D, if you split every coordinate in half, you get four sections. In 3D, you get eight sections. And that's still very efficient. And so, you know, geospatial databases do exactly this, right, they partition the space geometrically in 3D. But if you look at the NLP example that I showed you before, the output of BERT is 768 dimensional. And so if you see how many sections you have created by splitting every coordinate, that's more than the number of items in the universe. Okay. And so clearly the branching factor is completely ludicrous, and you can't really operate that way. So any kind of naive partitioning of the space is just going to blow up in your face immediately. Okay. It's kind of a social proof that this is, again, hard. There are tens or hundreds of algorithms out there that have many internal parameters to tune. And they keep competing with each other on which one is faster or more accurate on which dataset. And the results are really dependent on the kind of data that you have. So this is another level of complexity that even just a specific set of data that you have really dictates which kind of index you should be using. Okay. So for the next about 20-ish minutes, I want to maybe 25 minutes. I want to spend on the core ideas behind these algorithms and some of the thinking behind them. I'm happy to take questions. And, yeah, I mean, the idea, we're going to talk about the internal workings of what the vector database, what the vector index even looks like. So going back to the early works of, you know, Piotr Rindek, Alex Andoni and others before him, kind of parallel like Yvonne Ravani and Rafa Lostrovsky, and others. They've looked at this question of how do we partition the space in a way that actually makes sense for data and that we can argue against. And I think Moses Charykar actually came up with this proof that I'm going to show you which is, I think, fascinating. So think about two points in space, two vectors in space that have some angle between them. It doesn't matter what that angle is. And think what happens to your space if you cut the entire, cut the entire space in half by a random hyperplane. It goes to the origin. Okay. So now a random hyperplane, a high dimension, sorry, if you look at these two vectors, you can always take two vectors and align them on a two-dimensional plane. In some sense, that's the definition of a two-dimensional plane. So we can always rotate those points and make them fit this joint. So this isn't a caricature. This is actually what's happening. And so you can ask yourself, where does a random hyperplane cut this two-dimensional plane? And that is in the random angle. That makes sense, uniformly random. Now, the only you need to ask the question, how likely it is that this random cut is going to split those two points so they line in different sides of that hyperplane. And the answer is exactly proportional to the angle. So of course, if they're exactly the same point, they will always line on the same side. And if they're 180 degrees apart, they will always lie on separate sides. And if they're 90 degrees apart, it'll be 50-50. And so it's obvious that you're just, it's just proportional to the angle between them. And so now you can boost to this idea and say, OK, what happens if I cut not once, but maybe 10 times, then each point is actually converted to this like 2 to the 10, like a 10-bit hash value, right? Which bakes into it this really nice property that I can mathematically say, I can mathematically calculate the probability of two points having the exact same hash based on how what's the angle between them. So it kind of, it translates geometry to this hash collision, you know, calculus, right? And so now you can boost that, right? And say, OK, I won't hash every point to just like one bucket, but I'll hash every vector to a bunch of buckets. And I'll make sure that I can boost the probability such that if two things are close enough, and by close enough, I mean close enough by my definition of close enough. In angle-wise, they will collide with high probability, which means I will look for them in the same bins, which kind of starts converting something like a geometric search to a, like an inverted index, OK? So you should think about the terms as these like bucket IDs. OK, so this is LSH. Incredibly powerful and mathematically beautiful idea, alas, it's considered in some sense not great just because it's not very efficient. It's not very efficient because it doesn't use the statistics and the geometry of the given data. You cut stuff with random hyperplanes. And we know that now that we can do better by clustering. So if you take the data that you have and you cluster it into points that are geometrically close to each other, you can ask, you can say, well, I don't know, well, this isn't perfect, but, you know, I can maybe take the blue point on the top left and say, OK, this is a surrogate for all the points around it. And if that fits the, or maybe the right top right, and maybe that fits, if that meets my condition, then I can go and search for all the points inside the same cell and maybe they'll conform as well. Now note that this is already not perfect. So on the bottom right, you see a center that met the predicate. No, this is a good fit, but then some of the points in the cluster don't meet the requirement, in which case you just wasted some compute. And then there's the cell next to it, the red one, which is some points are actually included in the, should be returned, but they're not. So that's a recall loss. So now we already are seeing that there's things are becoming a little bit complex because now you're already having the idea of approximation, the idea of a loss. OK, you can, you can already see that you have to get comfortable with the idea that this database is already approximated in the sense that it will not return some points that it should have. And if you, if you don't get, if you, if you don't allow for that, then you really can't get any acceleration or input. Okay. Did I miss slide. I think the slide missing here. Okay, I don't know what happened here, but all right, so I'll talk over the slide that's missing. Note that to figure out what to do here. Okay, I have all the points are the red, you know, the circles, and the X's are the cluster centers, right. I basically need to compute the, the, the dot product or the, the, the, I have to evaluate the predicate the geometric predicate for the for the classifier this case the red diagonal line in the center I have to compute that for every center. And the question is, well, can you get away from that because maybe you have many, many centers you might have tens of thousands of centers. And the answer is, is yes, and you can actually create this like navigation graph on points, such as start from a random center from a random point in your data, and you say I'll advance to the next center that improves closer to meeting the requirement of the predicate. Right, so gets close moves in the right direction or gets close to the point that I'm getting closer to. And you do that, and you can actually organize it in hierarchies. So you take bigger steps for first and then smaller steps and the smallest steps at the bottom kind of like IP routing or, or, you know, the postal service right so you kind of send the right state and then right city and then the right neighborhood and so on. Incredibly powerful and, and kind of a nice idea. I can tell you that it has been performing really really well in practice and kind of generalizes the idea of a skip list, but mathematically it actually it kind of falls short in a in a serious way and so it does really really well on some datasets and it does really horribly badly on others. Oh, that's a slide that's missing. Okay, I don't know what happened just move around. Okay. We can skip that now. Okay. So now we have Just be clear. You're showing us different ways you could do this and that and you're explaining why it is efficient. Is that correct? I'm explaining different ideas. They're not insufficient. They're all good ideas of all used in practice. I am. Maybe I should have said this in the beginning of the talk. I'm more interested in showing what exists and what's hard and interesting than trying to tell you that what pine cone is doing is right. I don't, you know, I, I, we can, I'm happy to discuss that. But it's, it's, for me, it's a lot more illuminating to figure out what's, what's hard what's interesting and what we can as a community can try to achieve together. And, you know, if you ask us, you know, what, what does pine cone do then pine cone does all of this. But as a, you know, nobody really knows how to do this correctly. Like we're all, you were holding this together. We don't really have a good answer to pretty much any of those questions by we are only in pine cone. I mean, like the world. Yes. All right. So like you're saying, you are doing LSH, LSH and some other things. Okay. All right. Thank you. Okay. So, Is it like, you know, some of these techniques work better for certain scale of data like LSH if like, does it like fail at one billion data set sizes with certain dimensional, like what about the dimensions we may come up with, or PQ works better. Like, do we have a sense of that? The answer is, we know, yes or no. So there are some data sets that we know, I mean, specific data sets that we know that one does better than the other by a big margin. It isn't so cut and try on just like, oh, high dimension, low dimension, many points, fewer points, it's a lot more nuanced than that. So I can't give you like a posted size answer to what makes one better than the other. Thank you. So the techniques that I showed above kind of are more focused on reducing the number of points that you are searching over. So this would have been like the inverted index portion of the database. Right. So, or the posting list or whatever. So this is, so you try to reduce the number of things that you scan of a number of objects that you look at. Now we'll kind of changing gears to say, okay, now that I've found maybe a region of a cluster or subset of points that I want to consider, can I scan through them more efficiently? Can I now actually compute, instead of computing the full dot product with D floating point multiplication additions, can I do less and somehow get away with discarding points as not a match with some high degree of certainty? One of the method, one of the standard methods is called product quantization. What happens, I take maybe a thousand dimensional vector, I chunk it up to maybe a hundred sections of length 10. So this is like 10 flows each. And each, I'll actually do clustering on each one of those 10 dimensional sections and basically round off every 10 dimensional section to its nearest neighbors. Okay. And then when I compute the quantization in real time, when I saw when I compute dot products in real time, instead of actually doing the dot product, I will, for these hundred sections that I have for the thousand dimensional vector, I will just keep the the ID of the center that it maps to. So now I have a huge compression. So I've now instead of a thousand floats, I have a hundred inch and those ins are short inch. So they might be like eight bit inch. So this big compression and then the dot product is computed incredibly efficiently because now you can, in real time when you get the query, you can just figure out each section where it maps to or how much how you know what's a dot product with each one of the centers, populate a list of values. And then, and then the approximate dot product becomes a lookup and an ad. Okay. And if the lookup table is tiny and you can fit it in in one of the, you know, one of the, you know, smallest caches you have or maybe sometimes in actual registers, then this becomes incredibly efficient. So now you've converted the problem of float computation in addition to very efficient lookups and ads of flows. Okay, you can actually convert those two integers as well, which which also helps in some sense, but this is getting too far. Interestingly, there are some even further work with each one of those sections of each one of those ideas that I'm telling you about, by the way, it's like, 10s of academics papers, sometimes hundreds experiments and thoughts and optimizations and so on. These are like broad ideas that people are working on. And one other thing that that is incredibly potent and important is that we've been trying to approximate these dot products and distances so that we can we can apply those queries faster. But note that we don't really it's that there's not all results are created equal. For example, if I'm a some point is very, very far away from from the center of the ball that I care about, I might actually, I might not actually care about gauging that distance very accurately I just want to know that it's too far for me to care about. So somehow, the, the you care to get a good approximation, only the two points that you might actually return so only two things that might actually be a part of the result set. Otherwise, I mean your approximation is is is meaningless because why would you care. Okay, that creates a very tight coupling with between the query distribution and the point distribution. And if you do that you mathematically you can actually figure out that the right thing to do is not actually cluster points to the nearest cluster center. In fact, the cluster center should not be at the center at all, it should be somewhere else. Right, which is very surprising, but nonetheless true and if you do that correctly you improve rounding and you prove the statistical behavior of your, of your index, and just to point that this is to work in progress, and this is very active and I say we don't know everything but I mean really the world. This is a product. This is a paper I'm publishing with the Krishna. I'm from Johns Hopkins, really on on how to do this it's scale and use projective clustering to actually get all of these on isotropic allocations correctly and boost it. And you get, again, like a 510% boost over the previous best known result. So, in, I'll just wrap up with just checking how I am on time okay I need to accelerate a little bit. So, if you thought if you're, if you thought this was already kind of getting a bit complex. Obviously now you have to combine those two techniques so now you have to search through your chunks of data, and within every chunk you have to compute more efficiently. Now you really have to combine the clustering and the, and the, and the acceleration in PQ and other types of compression. And that, you know, of course, you know, makes everything a lot more complex and, you know, there's a lot more to tune in and get right. So basically, you want to you're going to, you're going to really mess things up with the fact that we don't only want to answer traditional vector search queries, but we really have to figure out how to, how to combine them with traditional vector SQL or no SQL queries. Right. So, for example, if you're a retailer and you look for items that you might recommend, then that's great but you don't want to recommend something that's not in stock. Right. And so you really have to filter on hard rules is, oh, yes, I want the most, you know, relevant content, but it has to be like whatever, you know, rated correctly for the viewer and stuff like that. So you can have two different approaches, one of them is to first filter with your traditional like posting list solution. Sorry, what people have tried to do more commonly is actually post-filters. So they do the nearest network search, they return maybe a thousand results. Maybe they want only 10 in the end, but they hope that if they apply the filters to the thousand, they would remain with at least 10 results. And that's what they use in production. Needless to say, that's logically broken and oftentimes you're left with nothing. But more often than not, you're just wasting a lot of resources doing that because you've now just created a very, very heavy step in the beginning that is not necessarily the right thing to do. Pre-filtering is the other approach where you start by filtering the results first to a subset of things that meet the logical predicate and only look for the nearest neighbor in them. But that nullifies the structure of the index. So now your index doesn't do anything anymore. Now you just have a set of vectors to look at and that is horribly inefficient. And so what we are doing and what needs to be done is to actually be able to mesh those two things together and be able to do really metadata and vector search kind of woven into the same index structure. So they actually operate efficiently together. I'll switch gears a little bit about the kind of the structure of the index is kind of laid out on disk and memory and so on and how vector database break even more patterns of other databases that exist out there. First of all, we have this geometric kind of content based sharding and layout on disk that maybe is kind of like a geospatial database or but commonly like it's not time based or just random sharding it's like it's very, it's very data dependent on which thing goes where which data which record or data point goes into which cluster shard or partition. Not partition in the sense like a database partition like a part. Second of all, the data itself is huge. So a record is not some some small Jason record this could be a thousand floats with maybe a kilobyte of metadata. And that is the object that you have to deal with and so it's not like blob storage or it's not megabytes but it's not tiny either so these are kind of mid sized objects that you have to schlep around and deal with and that creates that kind of it makes kind of breaks a lot of assumptions in other database design design patterns. The indexing itself. So now when you have those partitions now you want to create the index index is not some appended to to to invert index. It's not, you know, building a be tree or something like that. It's an incredibly computationally heavy process. So you compute clustering and you, you, you know, you know, it's it's those are heavy enough that they are actually delegated to GPU sometimes. So the, the creating of the index looks a lot more like I'm running some very heavy machine learning training process and it is like, like a quick, like, you know, tallying of properties or just sorting or organizing and to top everything off. You unfortunately have to deal with the old to be like point updates. So it's not even like just a pure append yes there was a lot of just append or heavy append heavy operations, but there are a lot of use cases where what you have to do is point updates and many of people can just update your vector presentation of an object, think about recommendation engine where where somebody views something or says say that they they rate something very highly, then you want to really update your representation of their preferences, which will immediately reflect the next things that might be recommended. And so that that that is reflected by you updating the vector representation. And so that is a very update heavy process. What are these blocks represent like, what are the different colors represent these different colors are just so different colors are think about them as like parts of this like maybe roughly corresponding to maybe these clusters. Okay, okay. These are, you know, think about them as a chunks chunks of geometrically like a co a co access to breakers. So this is kind of how you want to lay out things on this. Make sense. Yeah, thank you. So. All right. So if this wasn't already breaking enough patterns, then while the updates are all to be like the queries actually all apply. And so these vector databases. The indexes are approximate. And so what they do is they don't actually pinpoint exactly what needs to be retrieved, but rather they give you a set of candidates. And if you remember I told you does this like scan operations that you pinpoint a subset of the clusters you want to look at that's that fraction is huge. It's not, you're not looking at one or two clusters you're looking at maybe 10% or 5% of data data. Right. And within that five or 10% you have to scan. And so the query itself is actually very scan heavy very CPU compute heavy. Okay. So this is one of the, which generates, which actually generates candidates so he doesn't actually pinpoint exactly the items you want to retrieve but rather generates maybe thousands of candidate points. So these are candidate matches to which you actually have to go and fetch the metadata in raw vectors. And first of all verify that they actually match the condition as you have suspected. You know, not returns, not returns the one that don't, but then you have to return the metadata and the vectors themselves to the users as a part of the result. They so asked so there's a second stage of a, you know, multi fetch, sometimes for thousands of records and you remember those are not small records in real time and all of this needs to conclude in in 50 to 100 milliseconds on, you know, on a lot of data. And wrap up by saying that, you know, just touching on on the, you know, kind of very high level architecture that I'm not going to dive into because this is becoming going to be more stand out on how you know, manage a fleet of such indexes and database instances on on public cloud separate storage and compute and facilitate snapshotting and recovery and dynamic load dynamic rescaling and so on, relying on a lot of really good cloud infrastructure that that really offloads a lot of the, you know, difficulties to other tools and so we can focus on on the stuff that we actually do well, while giving a very, you know, highly performant, you know, and and stable, you know, service. I'll wrap up by saying that we are not even remotely done. I started by saying, looking at these like linear classifiers like top products or bowls or, you know, cosine similarity, you can generalize this to any machine learning model, we're not even close to being able to do that by we again I talk about the community like nobody knows how to learning query distributions, we know that in traditional databases that already helps you can, you can learn the skill of your queries and optimize the index to do those better, especially in all lap like databases and all up in the sense of not in terms of the scan but in terms of the aggregation, metric aggregation and so on, very efficient they're heroes so we can retrain and refine and we can actually change the vector presentation on the hood to produce better results and we can measure that. And again, wide open area for research, improving database vector vector indexes themselves, you know they're there again 10s and hundreds of published papers and algorithms and open source software. I can tell you that they all leave a lot to be desired and we're not even close to being able to deal with things like concurrency and read write. Safeness and lock lock free structures and, you know, we're still very much in the early days, in the sense that people mostly care about the size of the index and maybe the throughput. You know, there's this community I don't need to sell the idea that that is not even remotely enough when you need to run something at scale and with like concurrent read and writes and, you know, in the wild. Of course optimizing storage hierarchies better now you know with the cloud. And that's not only, you know, caches and memory and desk you know you have many types of disk, you know, different distances from the machine, all the way up to like, like, like S3 and blob storage. So you really have to be able to be great between those gracefully specialized hardware we have companies developing anywhere from GPUs to like, literally specialized chips that just do this. And figure out if and when those should be used auto inventing of complex data like text and images and so on, and the list continues I can I can talk about each one of those for, you know, too long. That's it. I'll put the plug that you know this you know we are all very excited about about the space. Thank you so much for joining this call with Ram, our VP engineering. I hope he'll be able to answer some of the hard questions that you look at me. I think Greg or VP marketing is also here. And that's it. Okay, awesome. I will have everyone else. Thanks much. The logos there are what your investors or where you run or like how should I understand that. Yeah, that's I should have spent more time putting the slide together it's more about it's kind of where people on the team came from. Okay. Okay. Um, right so I will open up to the audience here any questions for you to meet yourself and go for those I will be selfish and use all the time. Okay. So, so you showed all these these ways to do different sort of, you know, the vector indexes. You started with the, again, the LS at LSH and then a bunch of different things. And then you said that like you do all the above. And so does that mean you have like this Uber data structure that incorporates techniques from all of these, or if someone says like building index you create all of them. And then at runtime you figure out which one to use based on what the query is. How should I understand this. So, no, so right now. So we do two different things. So first of all, we have our own index and data structure that that provides a really good balance between all the other all the properties that we care about is very highly optimized has all the filtering and everything inside of it. So we incorporate ideas and solutions from those literatures and algorithms and have our own vector index that has all the properties that we need. We also have integration with open source, so that, you know, if somebody is really like super gung ho about a specific index from FIS and they have some huge data set and that happens to perform, you know, 10% better for them and they don't need these other, you know, functionalities then we can integrate that is as the index under the hood. Does that make sense but you have to choose in the front. You don't change an index a lot like mid flight. That's what you're asking for have many of them. So it gets repeated saying you have your own proprietary index data structure you guys built, then you also pound include wherever that the Facebook one is or whatever else is out there. And then someone says index index the data set of dog faces. You have to declare ahead of time exactly what index check they want to use. Right. And that is now not yet available on the on the free tier or the actually not even on the not free tier, but the public offering. This is right now kind of more of an advanced feature that like very, very large customers can really leverage again because it's it's one of those things that people know so very little about how to operate those indexes and what what their choices actually mean that it's incredibly easy to shoot yourself in the foot. And so we're trying to protect against that. Okay. We have questions in the chat. I mean, Steve Moyni you want to go for it. If you're gonna meet yourself. Hello, can you hear me. Yes, go for it. Yeah, we're interested. Thank you for the talk. I wanted to understand was this different from side DB side DB also say oh they've vector database want to kind of hear from your perspective. Um, so I have to say that I don't know side to be from is on the core. I can I can be take a quick stab. So we are, we aren't necessarily optimizing for point retrievals of vector coordinates and things like that right so, while we are a vector database, we are more about semantic search and kind of similarity between vectors and things like that that is really this whole thing is optimized for. So we are not necessarily optimized for retrieving say a few coordinates from vectors are doing kind of those sort of scans and things like that. I think side DB is also trying to do manipulation, like, you know, dot products and like, you know, manipulating the, the, the matrix within the data system itself right and pulling out the R. And but I don't know whether they have to serve similar research stuff open. Yeah, not yet, I think. Yeah. Thank you. I used to use point, you know, you probably need this crowd so this is going to come up. You got to know the other databases out there. Yeah, I'm, that's why I have run here because I am a scientist and algorithm. Yeah, I'm more on the machine learning and algorithm design and so on. And so I know, yeah. So Andy, Andy, I don't remember this but what was side DB the one from Stonebreaker. I think it was like, yeah, okay, got it. There's Tile DB and side DB are both from Mike. Yes. Yes. And then but as far as they know, there's a company backing it called Paradigm 4. I decided the project might be dead. I don't know. I haven't, I never worked on that. I don't know what's going on. Into some. But Tile DB is active, right? So I'll just give a talk about the few weeks ago. Done. All right, B Paul, you want to ask your question? He's still here. He's there. I'm sorry, Vero, sorry. Hey, hi. So I think something which I asked before, so as a customer who is using your product, how do you like, you know, make sense of like what kind of evaluation metrics we use and how coupled is it to the ranking layer. And finally, you know, the ranking layer defines what kind of metrics, you know, impact that a product can have. So curious, how do you work with the customers at an interface level? Yeah. I can take it, but Ram, do you want to take a stab? Yeah, by the way, I think, Vero, just to answer your other question as well quickly. There is broadly LSH, the graph-based algorithms that Edo pointed out and also the product quantization and so on. All of these algorithms have different trade-offs in terms of kind of accuracy, speed and things like that. So when we create benchmarks, we try to create benchmarks that not just look at accuracy or recall at a particular K, but also things like ingest throughput, query latency and things like that. Customers care about all of these three. The general problem here is all these algorithms are, I mean, their accuracy and recall depends on data. It depends on the characteristics of the dataset. So where we are at right now is there's a large class of public datasets on which we can run benchmarks and we do run benchmarks. One of the things we are starting to do is also run benchmarks on customer datasets. This is in progress. It's not there yet, right? But since customers do ingest data with us, we can run benchmarks for them, right? And right now we don't do that. We kind of leave it up to them to do. So we provide them the APIs and so on and they can run their own benchmarks. But one of the things we are trying to do is also kind of automate that process, but we are in there yet. So today we have a whole bunch of public datasets on which we run benchmarks. Does that answer your question? Sorry. So I think what you're referring to is I think there are some system metrics, but then the metric you're referring to is recall at top K or something like that. Recall at top K and a few other metrics that people care about around. But maybe to add to that, what Ramo is saying is that what people know how to measure and what they measure ends up, in our opinion, being severely lacking. So there's like read, write, ingest speeds. There's like a lot of different performance profiles that we really care about. Now, we know what customers really care about, but incredibly hard to replicate and be able to actually regress against and so on. So that's a performance and kind of behavior characteristics. But there's also the metrics of how do you evaluate the set of results. We are in some sense like a search engine in the sense that people take get some sort of results out. And they want to use that and they can kind of try to assess how good the results are, right. And for that people use recall as a surrogate of quality. But I can tell you that recall is actually a very poor surrogate of quality. And so we also have to have another set of just like data science and kind of real world metrics on how good, you know, how good is the set of results compared to the best that you could do. And, you know, again, the community has settled on recall. But I can tell you that it's a pretty poor surrogate of quality in many, many scenarios. And again, we have, of course, more than we mentioned that in any one. Thank you. So, I mean, this gets back to maybe the thing you mentioned before about like, I understand that you're doing this on private data, but someone can replace your custom data structure with the Facebook one. I guess this gets to the point of like, you know, which one should you use, spend the manufacturers, I get that. But we even with in, you know, the pine scone pine cone to set a custom data structure. I'm assuming there's a much of different hyper parameters you could tweak about how it slices and dice is the data right. Correct but but by design very very few. I mean that that's the whole I mean that's a big part of what we try to do is we really don't. You don't want to expose those hyper parameters to the users because you have no idea right like I get that, but like. But we also ourselves, we also ourselves don't want to rebuild like many indexes and just figure out oh we use the wrong parameters because we've trained the whole thing and tested it and oh we need to, you know, turn the knob and, you know, from a. That will also be very expensive so the design of the algorithm itself is such that we are, you know, incredibly resistant to creating this like arbitrary knobs. But your, but your, your, your point is taken that I think the point you're trying to make is that with those open source parameters and even with our, our index that must be at least something you need to tune or for, or. But, but I wasn't thinking oh you should run your own optimizer to figure these things out I'm actually like the index themselves it's the compute side that's the expensive part right or is it reading the data was the most expensive part of building the index. Go ahead. It is it's very, it's very compute head yes it's it is computer and reading the data. And again it really depends on the index you know to try to index if you're trying to use a hsw this is like the multi layer graph that I showed you. That on a million point can easily run for an hour on a pretty beefing machine, no problem in the just, you know, just just creating the index, whereas, you know, a very, you know, throughput right throughput or optimized index might go through the whole thing and like, whatever, like a second and a half. Right so this is what we're talking about is two of algorithms and options that you really have to consider. That's why I said it's really easy to shoot yourself in the foot if you don't know what you're doing. You can just set the power to just replace the name of a, of an algorithm and suddenly the whole thing grinds to a halt or just you get some complete garbage out. Yeah, it's, it's. And what I'm getting at is like, like whatever makes sense that we're pine cone at some point, expose as a single logical index interface to the customer, but underneath the covers, they're maybe, maybe, maintaining multiple physical ones at the same time. And then there's some kind of internal optimizer that can decide. Oh for this kind of query the person asking for I know I want this index versus that index. Exactly. Exactly. You guys do want to do this or you are doing this. You want to do it. Okay, okay. It's hard because I'm not saying like, okay, yeah, yeah, I mean it's not, it's not built yet, but it's definitely something we, yeah, it's something we do definitely want to do and keeping keeping the API is incredibly simple is definitely a high priority for us. Okay. So there's like after the index lookup you have in some of your slides that you show you the index lookup and then you blast out a bunch of these point query lookups to the actual data itself. Is there anything special you're doing there or is it because it's S3. Maybe you do some batching but there's not not really much any, you know, you don't know what the next layout of the data is on desk you can't really optimize for that. So he kind of treated us to separate things. So for example, we treat the vector and metadata for the last stage of the of the ranking as like a highly optimized key value store. That's kind of how we lay it out. Whereas the index is optimized for scans. These are two separate things that are treated together so you can think of the last stage is just a fancy key value store. But that is the major basic anti convention and locality like the two things. Yes. Again, two things that could be actually. That's that's actually a great, great point. One of the things we are working on is kind of sorted data structures within that store so that you can, you don't have to do a key value multi get but actually we can do fast scans because you'll be scanning like a small amount of data at that point. And today we don't do that. We still managed to get reasonable performance with just fitting it as a key values to so to speak because we are retrieving only about 1000 sort of keys, but you're 100% right we can get far more efficient with scanning. If he had if he sorted. Okay. I get to two more questions. Once more to pie and sky question like, I mean, I usually with the interface look like other than like hey look at this index query index. But like you could have a declarative language that looks sort of like SQL. Maybe using something that like UDF or there's a thing from timescale DB they have extensions to photographs. Now you like to do like that pipeline thing. You could imagine something like that for you guys as well right. Yeah, I think I think SQL is probably. I don't think SQL is the issue meaning you can you can express this as SQL right in fact we can express the nearest neighbor search as SQL. The difference is only in how the index gets built and how things get kind of how the optimizer kind of comes up with the physical execution plans right. But yeah, I think everything needs to be talked about here could be expressed as SQL. You know somebody though actually maybe I take back what I said like if you then give SQL people are like, Hey, why can't I do this like whatever random SQL function they want to do and your data. They're treating you like something that you shouldn't be doing and you get down you're down a route and all you shouldn't do it just focus on the exact thing you guys are good at so yeah maybe SQL isn't the right thing. Yeah, exactly that's exactly why we didn't go the SQL route because that opens up the database for so many other things that we don't necessarily want to want to be our good at. Yeah. But then again, like you must say they're going to do this but like this the register the snowflakes of the world could say okay now we support these, you know vector betting indexes on your existing no snowflake data. But maybe because it's so computationally expensive to build and you have to, you know, you need somebody like, like you know there's no working on this for for years to know how to make sense of it. Maybe they're not a threat. Yeah, 100%. Yeah, I mean, I don't think it. Yeah, I don't think it's a thread in the sense that I believe like I guess you guys that there's a reason why we have a lot of different databases. Yeah, it's like there's super specialized whatever you know and yes if you have whatever if you have like, you know, 10,000 points. Like a data chunk. It doesn't really matter what you do you can put it in any database you can use it as a flat file, you know, you can, you can do whatever the hell you want you don't need a database when you get a billion You're going to see, you know, it's going to it's going to be, you know, 200 million versus, you know, 30 seconds or whatever like an hour or what I don't know, it's going to be heavy. I would say I said to the fact that you've never heard of tidy even though you're a founder of a vector database or start up means that like it's still very early in this market space so if you're not on for now you're not on Amazon, you know, radar, which is a good thing for now. You can go it up. Okay, so my last question before you go it's rising in status. Is there any any sort of deep background or a story behind why the name Pinecone. Um, so it's not very deep in the sense. Yeah, it's not very deep. It's just, if you look at a logo, then maybe it'll explain it but Yeah, I get that but like, if it's just a logo that's fine. No, it's not. You know, it's, I find pinecones to both be kind of geometrically kind of appealing and kind of, you know, it's both complex and simple and in the same time kind of abundant and obvious in the same time. There's something about the complexity and the simplicity of it that I felt like there's something that we're striving for but it's not deeper than that. No, that's a, that's a great answer. Right. You need, you need to like, you know, hammer it up or like build it up a bit more. Like, not saying you go like the stone around, hey man, you really look like a pinecone and your mind's gonna be blown. But like, I would pitch, I would pitch it that way. Right. They're deeply complex but really simple to look at. Everyone understands the pinecone but they're super hard. You actually think about it. That's how I would go.