 Let's get started then. I think people have stopped milling about. So, cool. Thanks for coming. I'm excited to be here. So my name is Chung, and we're going to talk about columnar formats for AI. This is not a topic that the two things that usually go together. Curious, since there are only a few of you guys, have you guys all heard of or worked with Parquet and just columnar formats in general? OK, everyone? And then, has anyone not used ChatGPT at this point? OK, cool. All right. So really quickly about myself, I'm the CEO and co-founder of LandCB, which is the company behind the open source lands columnar format. And I've been making data science and machine learning tooling for almost two decades at this point. Once upon a time, I was one of the original co-authors of the Pandas library. So if you're familiar with data tools, you probably use import pandas as PD. And most recently, I was VP of Engineering at 2BTV, where I focused on recommender systems and ML ops. And that was what motivated me to get on this path of creating this new columnar format. And the thing that I noticed at 2B and my co-founder was working at Cruise at the same time was that Parquet and Oric and really existing data formats, they're not a good fit for AI. So of course, Parquet is about 10 years old at this time. And it's done great for analytics and just managing tabular data and working with those analytical capabilities. But for machine learning and AI, we now have lots of things that are different. The data is very different. The workloads are very different. And in the 10 years since the inception of Parquet, storage technologies have progressed a lot as well. So the question that we have to ask is, what does this mean? And can we build on top of Parquet, or do we really need a new format? So first off, the data has become really different. So the data has become much bigger. So from a floating point numbers or integers, we now have embeddings, long form text, images, videos, point clouds, so on and so forth. They've also become much more nested, or schemas have become more nested. So in a single metadata column, they might combine for an image, you might have different patches within that image, and then you have bounding boxes for each patch, labels for that bounding box, and then object identifiers, and so on and so forth. And finally, tables for machine learning have also become a lot wider. So it's not uncommon to just have hundreds or even thousands of columns that are all features for your model, or something like that. And workloads are also very different. So analytical use cases really focuses on scans. You're scanning a column to do filtering, and then you're scanning the other columns to do projections, and you return a subset of your data. Now AI needs random access as well. For training, for vector search, for exploratory data analysis, active learning, the list goes on. So we're moving from this world of BI into AI, and Parquet and ORC, by design, does not really support good random access performance. And as a result, these existing engines that are built on top of Parquet, which includes like Iceberg and Delta and Hootie, and so on and so forth, they often just don't have these common AI transformations that you need. And lastly, we've seen this explosion in storage technologies. Have you guys seen the recent announcement by AWS on S3 Express? Have you guys seen it? Yeah, so I'm like super excited to try it out. There's some like arrow rust issues integrating with that new API, but once that's done, I think what it's gonna enable is for a lot of these like vector search use cases and things like that, we can just store the data directly on S3 Express, and it's gonna give us a lot better performance. But even without sort of these managed solutions, if you just look at storage like NBME, IO bandwidth doubles every three years roughly, it's like the new Morse law for storage, and they've also become a lot faster. And when Parquet was designed, cloud and object storage wasn't nearly as ubiquitous. So a lot of the assumptions are essentially made for a much older generation of storage technology. So there's a recent paper by like some folks at Tinghua CMU and then Andy Pavlo and West McKinney to look at sort of performance of Parquet and Orc in various environments, SSD and S3 for representative machine learning workloads. So if you look at some of the timings on vector index search for Parquet and for Orc, you're looking at like high number of seconds to like minutes of response times, which is kind of ridiculous. You're not gonna be able to get interactive responses from them. So really we need something new. If you know, if you looked at the code bases for different implementations, there's a lot of baggage and design that have accumulated over the years, and very fundamental changes are required. And just if we were to modify the existing implementations, modifications require too much overhead. So development speed will be very slow. And on the other hand, migration is now super easy. So it used to be that it was very difficult to create a new data format because you have to like integrate with every tool that's out there. But Apache Arrow has become a standard, right? And so every like query engine data framework now integrates with Apache Arrow. So for new formats, you just integrate with Apache Arrow and now migration is literally two lines of code, one to read into arrow, one to write back into this new format. And also it means you're not locked in. So if you wanna try a new format, the reverse migration is also two lines of code. And the tensor formats that you have, like have you guys used like TF records or at least I've heard of TF records, right? So they're also not very good. They're optimized for tensor storage and for training, but it turns out with AI, we still need a lot type of queries, right? Because if you have like a large number of tensors or images in order for you to do training, you also need to do filtering and you need to do sampling and things like that, that these tensor formats are not very good at. Okay, so we know that that Parquet and Oracle are not very good at these new workloads. And it was, we've seen that it's not a, it would not be very conducive to actually try to modify their existing implementations. And so that's why we built Lance. And Lance is essentially two things. One is it's a new columnar file format. And so this is columnar storage with fast point queries, that's analogous to Parquet. And then it's also a table format, kind of like a lightweight version of iceberg or hoodie or Delta that gives you lightweight transactions, allows you to integrate secondary indices and manage versioning and things and time travel on top of these Lance files, right? And so overall, Lance, the format is intended to be an open data platform for you to build on, right? With it's a lot more than just vector search and it's zero vendor lock-in. So on top of Lance, we've built Lance CB, the vector search vector database, which we'll talk about later. But you can use it in model training to speed up your training data loaders to keep your GPUs really well fed and to make it much cheaper to manage high scale data, like multi terabyte to petabyte scale image data sets. And Lance is automatically integrated into these analytical tools like DuckDB, Pandas, Polars, and also Spark. So one of the interesting things about performance, we saw that performance chart earlier. So at the core of it, why random access matters is for filtering or for like indexing, like vector index or like a scalar index on like B tree or something like that. At the end of the day, you're filtering for, you're selecting some subset of your data and then the parts of your index will point to the correct rows in your data set, right? And they're sort of spread out throughout that your data set. And that's why the random access performance is actually really important. So let's take a short break and let's see what sort of Lance format in action. So I've got two identical data sets set up. So one is a Lance data set called Pets. And this is essentially the Oxford Pets data set. So there's a, and I've added a vector, the species metadata and then the URL and the actual like image binary itself. And then I've also created the same data set in Parquet, right? Pyro, yeah, in Parquet. So what I'm gonna do is, you know, select a random row from both data sets and we can just see it. So what I'm gonna do is just use NumPy to generate a random index, right? And then so I'm gonna say, PQ the Parquet data set, I'm gonna take a randomly generated index and this will just be one sort of one index, right? So let's go around a bunch of times and we're gonna see that this is roughly 350 milliseconds or so and this is only, whoops, this is count rows, this is about 7,000 rows, right? So when you scale up, that time also goes up. So it becomes a lot slower at large scale. Now we can look at the same thing, but instead of the Parquet data set, I'm gonna look at the Lance data set and what that's gonna show you is, again, it's gonna run it a bunch of times and then if it finishes, right? So you can see that it's about 171 microseconds, right? So this is about roughly 2,000 times faster to select out the data that you want. And of course, you can use DuckDB to query some of these data sets, like I'm not gonna show the Parquet because that's sort of similar, but for example, like I have my pets Lance data set, I can directly say, show me the URLs from pets where species equal dog and just give me 10 of them and you could just use this in SQL queries and then other SQL engines as well, cool. And so if we kind of go back to that paper with the benchmarks, if we put Lance next to Parquet and Orocy, so the summary is in different situations, Lance is at least one or two orders of magnitude faster than the faster of Parquet or Orc in these situations. Right, and so today I think Lance columnar format is used in a variety of different situations. So one is large scale data mining for autonomous vehicles. So you've got petabytes of image data set with deeply nested autonomous vehicle sensor data. And so you wanna do things like describe a situation where the model was or that your model and the car applied the brakes when it shouldn't have, right? And like finding edge cases about your vehicle that made point to unsafe driving behavior or things like that. Number two is generative AI. So a lot of AI native companies that are multimodal collect just vast amounts of image data that are generated and they want to use that for training or analytics or debugging. And so Lance makes it a lot cheaper and a lot faster for them to do so. And the last bucket is semantic search for LLAM's recommender system, so this includes RAG and agents and things like that. Okay, so let's, I'm gonna speed through this next section a little bit. Just to sort of wet the appetite on what makes Lance format actually different and how we're able to achieve these performance improvements in a couple of slides. One is the data layout and the encoding. So in Parquet, when you have like an array of strings, the layout is offset data, offset data. So in order to retrieve one row out of your array, you have to read the whole row group. And then you can get that one particular row. Now, if you're storing integers, reading a thousand integers isn't that bad because you're limited by like the minimum block sizes on the storage technology anyways. But if you're thinking about images or embeddings or point clouds that can get up to like hundreds of kilobytes to hundreds of megabytes per record, then reading out a thousand or like the whole row group becomes prohibitively expensive. Instead with Lance, we actually separate the offsets and the data in the encoding so that you can, using the offsets, you can have constant access into that array and you can amortize the offset read time. Number two, the scanning and the IO execution is also very different for Lance from Parquet. So we use a technique called late materialization so that, you know, on, so if you, let's say, if you look at the top right here, we're running a SQL query that selects out the ID, the timestamp and let's say LiDAR cloud from a dataset given some filters, right? So the regular plan is you select out all the columns, you scan all the columns between the select and the where clauses. You run the filter, you run the limit and then you return just the select columns, right? So the problem is here is that LiDAR cloud, that column presumably is very large. So what you're doing is you're reading out the whole really large column and then you're filtering it and you're essentially throwing away like most of that large column that you've read out, right? And so that can be really slow. So instead with Lance, the scanner will only scan the where clause columns, the predicate columns first and then at the very end, it does the take on the projection or the very large columns. So this for, if you have columns that are large binary blobs, this is much faster and two, this is only possible if you offer, if your format supports fast random access, right? And then the third thing is storage. So with modern MVMEs and cloud object storage, you have a lot better support for higher parallelism and like very deep IO cues. So what we wanna do for a lot of these applications is to make your sort of IO scan process very shallow and very wide. So you can try to store a lot of the metadata outside or an external data structure. So you can cache a lot of that metadata and then to get the requisite data that you need for a particular query, just issue lots of parallel requests and that'll give a lot better performance as well, right? And then the last thing I'll mention is versioning and time travel. So we all know that with Parquet, you can't actually append to the existing Parquet files. They're designed to be, they're immutable. And so what we do with Lance is we, you can manage multiple Lance files in the Lance table format so that we automatically version your data for you. You can append or delete or add and drop columns and these are, we automatically create new versions and you can time travel. So you can say things like run this query on this dataset as of yesterday or the day before. So that in machine learning, if you wanna do, you reproduce your result from like a week ago when maybe your model gave some weird output or when you noticed a problem, this is essentially free. And then if you're, and the same thing with time travel and restoration. So in production, a lot of times if you find that your data pipeline messed up or something like that in order to be able to go back in time and restore a previous state of the database is often an expensive operation. But with Lance, this is essentially free and instantaneous. Okay, we talked about the vector index. So with the table format, it makes room for you to integrate secondary indices. So with the file itself, you can have these vector indices, you can have scalar indices for, for you know, string and numerical columns. And these indices tend to be, these indices can be fairly large. And so this is why vector databases today are very expensive because everything has to be stored in memory. So if you have a lot of data, it's not very scalable and you have to separate it out and worry about sharding that gets your architecture very complicated. Whereas with Lance, everything is disk based so that we can separate compute from storage and that allows us to scale up really well, right? And so that's, that's sort of the basis for the first product that we made on top of this format which is a Lance eB of vector database for AI retrieval. And not only is it really scalable, it's very cost effective and it runs in process. So it's kind of like SQLite or DuckDB. You don't have to worry about like connecting to a server or operations and it's super flexible. So I'm gonna take a really quick look here at a demo that I've set up with a Lance eB table. So now I'm just gonna connect to a local directory that has some Lance eB tables in it. And so this is the Oxford Pets dataset. So essentially it's a, I got here. So it's basically like a bunch of images of like cute cats and dogs. And we're gonna search through them. So with Lance eB you can actually use a embedding registry to abstract away all the embedding mumbo jumbo. So you can just say, hey, I'm gonna use OpenClip for this table. And then you can use Pidantic to make it easy to declare the data model for your retrieval. So here this Pets table has three columns, vector, species and URL. I'm not gonna bother to re-embed everything but you can, I'm gonna just open the table. And so what you can do is say like table.search and you can just put in a text and then you can convert that to Pets, the Pidantic model. And then you can show it, show the image. All right, so this is pretty easy and I'm now done like multimodal semantic search retrieval with just a couple lines of code and without having to really worry about the embedding generation process myself at all. And what's interesting about LAN CB2 is because it's multimodal, you can actually directly search using an image. So I've got this query image locally of some baby samoyeds and I can say, hey feed this query image directly into the search process and I've got this, that dog that is essentially another samoyed here. Now what's interesting too is I can also do filters. So something like, I can say something like where a specie equals cat. So I'm gonna search for images that look like this, but is actually just a cat. And so this kind of makes sense, right? So I'm searching for like fluffy and white features, right? And so it gives me this cat that makes sense. Okay, and then let's say just to go up, continue on this, let's say like I make a mistake, right? I delete all the cat pictures from this data set and now I end up with only dogs and then I realize, oh, I messed up. Actually, I wanna undo that. So I can say restore my table to the previous version and instantaneously I didn't have to do anything, right? I can get my cats and dogs back. All right, the last thing I mentioned was scale, right? Because of the disk based index and the column reformat. So what enables us to do is serve really large data sets using like not that much compute. So I've got this table called test1b, right? And so if you can't, yeah, it's hard to count, but it's, you know, if you counted, there's nine zeros. So it's one billion vectors on this table. This isn't running on my laptop. There's no like cloud magic happening behind the scenes, right? And so if I run, let's say like I run a time it on I'm gonna feed in like random search vectors and get it so this is allowing me on my laptop to search through a billion embeddings in like just a couple of milliseconds, right? So this is the power of just better storage and better lower level data infrastructure. Cool, all right. And then basically there's, I've had a lot of fun working on this and there's lots of very exciting things to come. So I think null support, sparse vectors, that's gonna be useful for like biomedical healthcare and then adding in like full text stuff, advanced file encodings, data compression and just like more native level ecosystem integrations to all the tools that you're familiar with from like Spark to Ray to Polars and beyond, right? If you're building generative AI applications or you're managing just multimodal data at scale, please check us out. So we've got Lance format, which is lancb slash lance. The open source vector database is lancb and if you're looking for like tutorials, vector db dash recipes has more than a dozen worked out examples, both in Python and also in TypeScript. So yeah, and I got a couple of minutes left for questions. Usually as a speaker I try to time it so I run out of time exactly and not have time for questions but I think I messed up today. So we got to like roughly two minutes if anybody has questions. But my question is that, is the lance format is a prerequisite to use lancdb? Yes. You have to use, I mean every record has to be in that format. Yeah, but when you use lancb, the like the data storage is taking care of for you. You can insert data into lancb in whatever format that you want and we convert it. It does the transformation internally. Yeah, yeah, yeah. So you can like, if you have a pandas data frame, Polar's data frame, arrow table or just like a bunch of Python dicks or whatnot. Though you can just shove them in the database. Okay, another question under that's there are multiple formats, different type of vector databases. I do not believe anytime there's a one fits all. I'm curious that what are the major cases that lancdb is not the right choice? Yeah, so for example, like let's say you have really small scale, like, you know, tens of thousands of vectors. Your data is, your data already lives in Postgres and you have sort of, you don't have needs for like advanced retrieval on top of like simple vector search. Then I would say, you know, if you're already built the expertise around Postgres, then you can just stay in Postgres and you don't need high scale or you don't need advanced retrieval features then just stay there, right? For, and generally like for, or something like today, if you want a cloud hosted solution that is like SOC2 compliant and HIPAA compliant and you know, you can't wait three months for us to get there, then you know, another solution is probably the right choice right now, right? So I think those are generally the situations where I think lancdb is probably not the right choice. Okay, thank you. Yeah.