 Hi, everyone. My name is Chung, and I'm the co-founder of LAN-CB and co-creator of the LAN-STATA format. It's been a lifetime since I worked on Pandas. Really glad to see Pandas 2.0, Aero, and also Polars becoming really popular. It's wonderful to see the Python data community still growing and taking on new things. Since that time, I've been focused a lot on ML ops and experimentation. And that's what led me to this, which is vector search and also unstructured data columnar formats. So if you like to talk shop, you can find me on Twitter or GitHub using the same handle, which is also my AIM screen name from the last century. So I think the core pain point that I've seen in this space for generative AI is that we're missing a storage layer. And what I mean by that is if you kind of break the ecosystem down into analogies to computer components, so the models themselves are like the CPU or the heavy processing gets done. The different frameworks from LaneChain to Marvin is like the motherboard. So everything kind of plugs into it. It provides a interface for everything and keeps everything together. But when it comes to the storage layer for vectors and for the raw data, especially multimodal, I would argue that there's no great solution. So how is this really possible, you asked, right? If you're paying attention to the generative AI space, you see lots of different options. It seems like every week a new vector database is released. And every other week, a new traditional database has added some sort of vector index to their offering. So it's not that we're missing options to do vector search and store data for generative AI, but we're missing the right tools. Because a new era of technology will have different use cases and will have different access patterns and will demand different data infrastructure. So if you kind of look at the space, it divides into two. You've got pure play vector databases on one hand. So these are like your pine cones, your weaviates, your quadrants, chromas. And on the other hand, you've got traditional databases like Postgres, MongoDB, Elasticsearch, all of that that they're adding on vector index. So in the former category, what's lacking here is that they tend to only deal effectively with vectors and they only deal effectively with vector search. And for traditional databases that just slap on a vector index, they tend not to scale well because the vector index isn't really architected in the same way that the underlying databases can't really deliver the same latency versus recall characteristics. And in general, vector search is much more CPU intensive than traditional OLTP workloads. A sort of analogous example was when I was at 2BTV, the real-time ads engine relied on a MySQL instance that stored all of the user metadata. And the data science team wanted to analyze that for ads optimization and retention and things like that. And so when I wasn't looking, they sort of asked that team to open up a port so that they can send a query to it. And then they promptly sent a massive analytical query to that database and brought it down. So the downtime wasn't long, but it was meaningful amount of dollars lost for the company. And so I think you'll see sort of similar issues as folks try to take, let's say like PG Vector or something like that into production, is that capacity planning and things like that go out the window when you add these much more CPU intensive workloads, right? And then for both of these options, nothing really deals with multimodal data very well. So if you've got images alongside with text and audio and 3D point clouds, there's really no good solution for storage and retrieval of that. And on top of that, for generative AI, data versioning, schema evolution, all that, and reproducibility becomes very important. And for both of those categories, it's really difficult to actually, it's actually really difficult to sort of do rollbacks without blowing up away your storage budget and to sort of trace your results, right? So all of these characteristics will cause a lots of pain points for the end user, which is like the new sort of category of AI engineers, right? So number one is that experimentation is too expensive. From data set to data set, from use case to use case, series practitioners that are trying to build production quality apps find that the optimal retrieval method will differ, right? So some might need vector search, others might need keyword search, and yet others might need a combination of these two. And with vector databases, user often have to sort of switch data infrastructure just to test a new idea, which is the opposite of what you want, right? If you're a data scientist, if you're an AI engineer, you have this brilliant idea, you don't really even know if it's gonna work, but if you have to then do all of the data engineering work to move data from one database to another, then that becomes really slow for iteration. Number two is that there's no single source of truth, so as I mentioned, vector databases doesn't deal well effectively with the raw data and really with metadata as well. So in production, what I've seen, especially for large scale use cases, is that users often have to maintain multiple data stores and engines. So there might be like one vector store plus one full text search index plus a raw data store. And you have to sort of hope to God that your data pipeline updates all of those things at the same time so that you can actually have a single request threaded through all three of these things successfully. Almost all of the vector engineering techniques today hold everything in memory, and this is incredibly expensive and it's really hard to scale. So if we do some back of the envelope calculations, like if you have a billion embeddings from A to two, this is roughly six terabytes of space using stored as floating point, flow 32s, right? These are 1536 dimension vectors. Now, obviously RAM at this scale is 100 times more expensive than an SSD, right? It's, you encounter like six terabyte or eight terabyte or 10 terabyte SSDs or multiple volumes very easily. It's hard to imagine a single node containing that much RAM. And sort of in a related note is most of the vector databases today don't sort of forgotten what we've done in data warehousing over the past two decades, which is the separation of compute and storage. So if you kind of go to these vector database websites and you price out the offering, right? A billion vectors can cost like 50K a month, USD, right? A lot of times this is like even if you're not sending a single query, it's just to store the vectors, right? If we can separate those two, right? If you do some pricing calculations on S3, storing this costs $138 per month. If you need faster SSD storage, this is about $270 per month. And for compute, you can store, you can throw in a pretty beefy note here for about $300 a month. So at the same time, it seems like we've failed to serve the new use cases in generative AI well. And we've also forgotten a lot of the lessons learned from data warehousing over the past two decades. And so this is sort of the pain points that we're trying to solve with LAN-CB, which is a new open source serverless and we call developer friendly vector database. Our users choose LAN-CB for sort of four reasons. One is zero ops. There's no servers to deploy there. It runs in process like SQLite or DuckDB. You just pip install LAN-CB and you're good to go. You can store vectors alongside with your documents, metadata and more. So overall, this gives you better end-to-end performance and just your code is a lot simpler. The types of queries you can run against your data is also much more flexible in LAN-CB. So in addition to vector search, we support keyword search. We also support running just a simple SQL for really high-quality and context learning. And the last part is not just scalability, but also cost-effective scalability. As far as I know, we're the only vector database that can support real-time vector search on billion scale with just a single node, which some of our users have in production already. Now all of this is enabled by LAN's columnar format. But before I dive into the details a little bit, I just wanted to make this a little bit more real and show you exactly what I mean. So for this demo, I'm gonna build a multimodal search app over the diffusion DB dataset from Huggingface. So this is a genitive AI multimodal dataset. So you've got images in one column, you've got a prompt, and then you've got a bunch of these metadata columns throughout the dataset. So right away, this is not something that other vector databases can effectively deal with, but you'll see in LAN's it's pretty simple. So what I've made is, we can walk backwards. So what I've made is a little bit of a small gradient app that uses LAN-CB and open AI embeddings to, not open AI, clip embeddings to search over the images and also the prompt. So for example, let's say if I type like portraits of, sorry, misspelled portraits of a person, something like that. What happens is I would call clip to embed the query that I added, and then I use LAN-CB to say search for that embedding over the image column, give me the top nine results, turn it into a pan as a data frame, and you can see that these are searching over the images themselves as classified by the clip model. Now if you remember, the prompt column here is strings. So I can actually search over the text as well. So something like, I don't know, Ninja Turtle. And so as you can see here, this is searching directly using the text. And so all the prompts will include some form of Ninja Turtle. So this one is portrait of Jennifer Lawrence as Teenage Mutant Ninja Turtle. I swear this was not made by me. And then of course we can also sort of search by playing SQL. I see how I'm doing on time here. Okay. Right, and so I can just throw in a SQL statement here and, uh-oh, let's see what I did here. Oh, there's no error. Maybe there's an error message here. Nope, interesting. What's that? No, I don't think so, but maybe. You can try it out. Oops, is it? Nope. Let me just try to debug this. I think this works. Okay, let me just try without any of the conditions. Okay, so there's something wrong with the word clause, but as you can sort of see what we're adding is basically the ability for you to just run plain SQL on this data. And so in production, I think when you look at all of the demos out there, it's very easy to just use vector search and not really care too much about it and get to like a sort of a compelling demo, right? And you get to sort of like this like 75, 80% quality like of retrieval. But to get into production, that's not enough. So oftentimes what folks, people are doing is you have multiple recallers that retrieve different results out and then you have some sort of re-ranker on top of that. And so it's much easier to do all that using one data store rather than two or three. All right, and let's get back to, let's get back to sort of stuff that's under the hood. So that was a little bit of a demo. Oh, wait, sorry, just to show you the code. So the Gradio app was generated by this notebook. It's part of the repos, I'll have the link on later. And so essentially what it does is you download the data from Huggingface and we've got a pre-prepared dataset in Lance that you can download and create a table and index on. And then when you open it, you also create a embedding function using the clip model from Huggingface. And then each of the search apps basically just cause the API for LanceDB which is dot search on like a text query or dot search on some embedding and or just running here, I'm just running a SQL using DuckDB but you can plug it Lance data into like pandas or pollers and more, right? And then pretty simple Gradio interface here. All right, so how are we doing all of this? I think what's different about LanceDB is that it is backed by Lance columnar format. Lance columnar format is an alternative to Parquet for AI data. So that means large blobs, that means nested data structures. And oftentimes these are stored in like cloud storage rather than sort of on-prem and in data centers. So how Lance differs from Parquet and other columnar formats is one, it's good for fast scans and also point queries. This is done through a better layout and an optimized IO plan. We've also handcrafted lots of SIMD code to make a lot of the indexing, especially in end code faster. In terms of features, Lance is not just a file format like Parquet but it's also a table format. So we can support schema evolution and versioning essentially for free, a call it zero copy because when you add columns or when you add rows, you don't have to create a snapshot and copy the data that was there. So for tabular data, this is annoying but it's not a deal breaker. But if you have like a petabyte of images and you add a column, you're not gonna wanna create, have two data sets that are like, one is one petabyte and one is one petabyte and like one gigabyte or something like that. So it turns out that fast point queries gives us the ability to add rich indexing and so that's how we're able to build a vector database effectively on top of this storage layer. And of course, Lance supports all of your cloud storage. In terms of ecosystem integration, Lance's primary interface is Apache Arrow. So Lance is on disk, Apache Arrow is in memory and then everything else plugs in on top. So Lance core is Rust base and we plug in Python wrapper on top of that and you have Apache Arrow interface. So anything that is Apache Arrow compatible can automatically work with Lance data. So that includes pandas, duckdb, pollars, PyTorch, TensorFlow and also Spark and Ray. And so today, Lance format is, it's great for vector search but it's also great for much more than that. So it's used in autonomous vehicles companies for large scale data mining. It's used in for generative AI companies for training their data lake. And it's also used in e-commerce consumer applications for a search, vector search, and also for keyword search and things like that. And in the last, I wanna leave some room for questions. So I'll probably go through a little bit quickly over these but let's take a quick look under the hood of Lance format. So what are the big differences between how Lance format is designed versus let's say a Parquet? So one thing here is if you look at variable binary data, so these are like your strings, where each string is a different binary length. So the sort of conceptual representation is you've got a bunch of data that's laid out and you've got offsets. They indicate where does one string start and the next string ends and the next string starts. In Parquet, the offsets and data are interleaved and that is why you can't have fast random query because you have to read out the entire row group to know exactly where one string is. Now in Lance, we've separated those two. So you've got a data array and an offset array and so you can amortize the cost of reading the offsets and then that gives you the ability to read just one particular observation. Now if all you have are like short strings and short variable binary data, turns out this isn't such a terrible, terrible problem, but if you have AI data, especially if you have images or large blobs like point clouds, this is a deal breaker if you're using Parquet. The vector search as I mentioned is implemented using the file format plus the indexing structure that is also part of the dataset but is not part of the data file itself. All right, so we use that for in-in indexing and there's an error-based API and what differs, what makes LanceDB's vector index different is that we're NVMe-based or SSD-based and so it's much cheaper and it's much more scalable. So you can find a lot more details on our blog, blog.lancedb.com. I already talked about the versioning and schema evolution. Essentially what it allows you to do is append data and if you find, hey, my retrieval quality is not what I want, it's very easy to roll back to previously, your previous state without having to experience any downtime. And so, all right, I think we're almost at time but so essentially what you wanna remember is one, you don't wanna just have any data infrastructure and slap together your tech stack, right? So the right tools would make it easier and not harder for developers. It should be scalable, it should be performant and should be cost effective and all at the same time. And the two things I've introduced, LanceDB is the vector database, it runs in process, it's developer friendly, it's very performant and it also scales up very well in a very cost effective manner. All of that is made possible by Lance columnar format which is an alternative to Parquet for AI. So if you want to find out more details, you can come to GitHub. So github.com slash lanceb is our GitHub org. Lance is the repo for the columnar format. LanceDB is the open source vector DB and all of the examples, including the one I showed today is in a vector DB dash recipes. And we're gonna be organizing a sprint this weekend. So I'm gonna prepare a bigger scale demo as well and there's lots of fun things to hack on for the Lance columnar format. So thank you. Okay, now we have time for questions. Any questions? Go ahead. Yeah, so there are some differences. So one is other vector databases store the indices all in memory, whereas we are disk based and essentially we only hold the partitions that are being searched over in memory. The algorithms might be a little bit different. So the ones that our users are using the most are IDF index and we have disk ANN implemented, whereas most of the other vector databases, I think the most popular one is HNSW amongst all of the other vector databases. Any more questions? I've got a question if I can ask. So you mentioned that the lance DB columnar format is mostly disk based, is that correct? Yeah, it's a disk based format, yeah. So do you use some sort of technique for loading some part of it in RAM and caching it that way for faster access? Yeah, so I think the IO plan is also actually different from traditional OLAP engines so that it's more, it tries to take advantage of more parallelization. And so the idea is that if your data's on cloud storage, cloud storage throughput is actually really good, but the latency really sucks. So what you want your file access dependency graph to look like is very, very wide but shallow so that you don't have too many serial dependencies between different API requests, right? So those are all sort of differences of how we try to make things a little bit more performant in lance. The caching will be, so a lot of the metadata will be cached in memory. The offsets arrays for various things will be cached in memory and can be amortized as well. Thank you. Any more questions? We still have a little bit of time. Cool. Awesome. All right, I'm gonna finish the video. Thanks very much. Thanks very much.