 The Databases for Machine Learning and Machine Learning for Databases Seminar Series at Carnegie Mellon University is recorded in front of a live studio audience. Funding for this program is made possible by Google and from contributions from viewers like you. Thank you. Alright guys, welcome back. It's another seminar here at Carnegie Mellon University. We're excited today to have Cheng Si. He is the co-founder and CEO. Is that correct? Yeah. Okay, co-founder and CEO of Lance. They are building a new file format and a query execution engine on top of that to replace Parquet and Ork, which is super fascinating. So of course, this is why we invited him to come and give a talk about this stuff because it's super relevant these days. So as always, if you have a question for Cheng as he's giving a talk, please unmute yourself, say who you are and ask your question and feel free to do this anytime. If you can't unmute yourself, post in chat and we'll interrupt. And again, we want to do this as Cheng gives the talk that way. He's not talking to himself for an hour in Zoom. So Cheng, thank you so much for being here. You're right now. You're at MIT. So we appreciate you calling in remotely from there. So the floor is yours. Go for it. Thank you, Andy. Thanks for having me. Thanks everyone for coming. If you have questions, please, Andy, please interrupt me because I've got full screen on screen chair. So I sometimes I won't see the chat. So feel free to just interrupt me at any time. I'll take care of it. Yeah, thank you. Okay, so today I want to talk to you about something we've been working on for roughly the past year. And that's a new columnar data format called Lance. And so quickly about myself. So my name is Cheng, CEO and co-founder of Lance TV. As Andy said, I've been making data science and machine learning tooling for roughly two decades at this point. I was one of the original contributors to the pandas library way back like 10 years ago, 13 years ago now. You know, and then I was CTO of data pad with West McKinney and the creator of pandas ended up at Cloudera through there. After that, when I was VP of engineering at 2B TV, a streaming company, and I basically really got into recommendation systems, MO ops for recommendation systems and also experimentation. So, you know, why are we why are we doing this? I think when Andy invited me to the seminar, it was something I was really excited about because especially, you know, ever since pandas, I've been really interested in this data and ML and or data science sort of marriage. And especially at 2B where I saw a lot of recommendation system both structured unstructured data is very exciting. What do you when data their data systems and their machine learning fits really well together. The tools that we get are much better and companies actually can be much better run. You know, the first hand experience I had at 2B was just how much difference a good experimentation platform for machine learning can make for a company. I think when I started there, we were maybe running, you know, one experiment per two months or something like that. And I spent two years sort of making the data systems and for analytics and the ML fit well together. And in the end, you know, I think in after two years, we had sort of like 30 to 50 experiments on at all times in the company. And it was amazing to see how much how much faster progress and duration got. However, the problem is that I call this like it requires a lot of couples counseling, both technical and non technical. So, especially now with unstructured data, I think one, you know, scale is off the charts. Data types are definitely weird. Workloads are much more complicated than than before. And then, you know, I say vector databases, especially sort of we got we got sucked back into, you know, sort of early 2010s. We'll see what I mean by in a second. So, you know, with unstructured data, like it with tabular data, like a floating point number right is for just four bytes or maybe eight bytes if you got a double. But if you look on the MLDB seminar website that logo is 145 kilobytes, many, many orders of magnitude larger. So what are you getting is like, maybe the row count isn't even that that huge but you're getting you know, petabyte scale pretty easily nowadays. Most startup companies, if they're working with unstructured multimodal data, very easy to get multi terabyte scale for, you know, for a small team that's already difficult to handle. You know, for, for certain teams that are that run popular generative applications, for example, you know, you get like 20 people and you have like several petabytes of data. So what that means is for your whatever your data system is going to be object storage must be front and center. And then all of the, you know, consistency and performance issues that come with that. Right. In addition, now in addition to all of the types were used to in tabular land, right now we have the vector embeddings images audio video point clouds. None of these things really fit into traditional database as well. You've got various plugins into like Postgres and others, but none of them like work all that very well at a native level. Right. And most of this is because like databases and traditional formats, as you probably all know, they're not optimized for large blob storage. We're also missing lots of semantic types, which turns out to be very important for building tools on top of these things. And if you if you're like bi tour analytics tool can't tell one binary column as an image versus another binary column as audio, you're going to be, you're going to have to do lots of type checking or, or just throw up your hands and in different queries. Right. And then on top of these semantic types, then we're also missing all of these common transforms, like for images you might want to do shuffle and rotations and add noise and things like that. And it'd be really nice to be able to push those down into kind of the database level to have native support for those. And as a result of this, like workloads are a lot more complicated. Right. So arcade is really not good enough for these things. So in EDA for ML, and also debugging, it requires fast random access. So you might do run a filter, and then that picks out like 10 random images throughout your data set. And you want to load that up for exploration or just looking at a bad case or something like that. Right. You need to access those across your data set very quickly. And also for training, you require fast shuffling, especially as models get larger. I think folks are finding that, you know, within batch shuffling, sometimes it might not be enough. You might need like global shuffling or something like that. And this also requires fast random access. And, you know, TF records and friends are also not good enough. They're pretty good for training. But like turns out, we still need fast OLAP for filtering analytics and all that that sort of selects the right data set and fees it into your GPU. So we're, we're TF records and sort of similar tensor, like overly optimized for overly focused on tensor formats tend to be missing this layer. All right, so, and then third is reproducibility versioning tracking is even more important in ML where you might need to checkpoint models. You might introduce some new data and find that like mysteriously your model scores are going down. So we need a flexible table format to help track all of that. Right. But you don't necessarily want to connect PyTorch to a Hive Metastore just to have access to your data set. And the last thing is, you know, you know, I think there are, you know, there's something like 20 vector databases nowadays. But if you look across them, I call this, we find that vector database a little bit retro. What I mean is they look less like full-fledged databases and more like just, you know, wrappers around an index, right? Most of them don't offer the data management capabilities of a database or, you know, storage around that index. It's kind of like a pain for a SAS B-tree index or something like that. Right. So, you know, once you get the vector results, you don't, you're not going to serve vectors back to users. You're not going to show, oh, this 1536 dimensional vector is the document you want or the image that you want. You have to then go somewhere else to actually fetch the thing to actually serve back to users. Right. This is just kind of an example why they don't look like full-fledged databases to me. Deployment also feels like, you know, sort of early 2010s where, you know, you had capacity planning for a Hadoop cluster. You're choosing the instance types and you're choosing the number of instances. And like, if you kind of look at your index wrong, you have to like, you know, re-index everything. And lastly, there's no separation of compute and storage. Again, this becomes very, very expensive at scale. I think very early on in this year in Genitive AI, we thought that we didn't really need scale. Everyone was running around with, you know, 10,000 vectors or, you know, 50,000 vectors. But, you know, as people move applications into production and people get serious, we find that embedding volumes are actually becoming much larger. And even if it's not maybe a single dataset might not have huge amounts of vectors, they might have lots and lots of small datasets that add up to a large number of vectors. Right. So, but certainly, sorry, I don't know why my slides aren't advancing now. Okay. Certainly you don't have to take my word for it. You can take Andy's word for it. So this is a chart that I stole from the recent paper that Andy wrote with Wes, an empirical evaluation of columnar storage formats. Right. So if you look at the timings using parquet and orc for vector index search, which is representative of a lot of machine learning workloads because it requires fast random takes. Right. The the the timings are, you know, very lackluster. There are some interesting patterns here too, where, you know, the comparison between parquet and orc are flipped on SSD versus versus S3. And that's kind of interesting as well. So to solve these problems, we kind of started a project called Lance about mid last year and in C++ and then we actually rewrote it at the very beginning of this year in Rust. So the version that you're looking at is essentially about a little less than a year old right so Lance is both a file format that has, you know, dot Lance files that is columnar storage that supports both fast scans and fast random random access. And this is your this is our analog for parquet, but Lance is also a table format. So you can group Lance files together into what we call Lance data sets. Lance data sets has metadata that provides transactions support secondary indices on top of Lance files. Right. And so this is sort of an analog to to iceberg or a more lightweight version of iceberg for machine learning. So Lance is, you know, where does it fit in the stack Lance is a disk format obviously because it's an analog to parquet. Our main interface into the world is a patching arrow. And, you know, you know, I think given the background of folks here you guys know arrow is this sort of open stand for in memory representations. And partly it's because, you know, it's an easy way for us to integrate into the ecosystem we started out with just two people. We're now, you know, just just over 10, and maybe like half the maybe half with half the team working on the format. So resources are constrained. And so with arrow, it was a lot easier to integrate into the rest of the tooling ecosystem. And then also it made migration super easy like converting to Lance data set from other column of formats is now literally like two lines of code. Today, I think, you know, Lance is used in a variety of different settings. So we started out in autonomous vehicles, where, you know, autonomous vehicles companies are storing like petabyte scale data in Lance to use for large scale data mining. So this would be something like, you know, in the Bay Bridge in San Francisco there was like accident Tesla had an accident where the algorithm applied the brakes, when it wasn't supposed to cause, I think like a fatality or something like that. And I think as soon as that happened, you know, every single autonomous vehicles user, we were working with, like panic and said okay, like go over like the petabytes of data that we have and find all instances where the model did something like that. Apply the brakes that it wouldn't. And with Lance, it was much easier for them to essentially slap like an old lap style data mining query over these like very deeply nested vehicle sensor data. Right then next was real some users that are using us to essentially be their data link for generative training. So these are multimodal generative AI for like image generation things like that. So you're looking at, you know, like a petabyte of images for training analytics and debugging and also serving in the like vector search. Right. So this is sort of the last bucket of use cases for Lance today, which is semantic search for both LLMs and traditional recommender systems. And we've, you know, we've been serving like billion embeddings generated per version of the model. And we can do this on a single node. I'll show you in a second later. And it gives very low latency. If you have a fast and VME drive. All right. So a couple of benchmarks. So this is a sort of a more high level benchmark where we use Oxford pets, which is like a data set with like, you know, maybe like 8000 really cute cats and dog pictures. And we ran some computer vision focused workloads like computing the, the label distribution, which is a value counts by class histogram of the area of the bounding boxes. Right. And then also, basically, and also to retrieve the inline image app based on a filter and a row selection. Right. So the first two are essentially just scans over tabular data. We're pretty much on par with parquet there for these scans. And then, you know, the raw column here is basically just the raw data set, which is image images and and XML and text annotations. Right. And then the last one here is where, you know, retrieving images across a whole, the whole, the whole data set comes into play after the filter. And we are, you know, here we're about two orders of magnitude faster than parquet. We also replicated the benchmarks in the in Andy and West's paper. We added Lance to it. And so for both scenarios on SSD and S3, Lance is roughly is at least one order of magnitude faster than the faster option of the of the two. So here you can't really see it's about two, two milliseconds here to fetch the data. And then here it's, I think it's something like it's something like 1000 or a few hundred to fetch from from S3. So, and this is on about 100 million rows from the lion 5B data set. So we're pretty, pretty happy with how, how perform it Lance has been, but I think there are lots of new things to do. So, Hi, this is a Are you going to talk about the root causes for why that was much faster? I mean, you did. Absolutely. Yeah. So we're going to dive in right now. Will you connect to what is the root cause for the performance difference in the previous slide to these techniques or I don't know if it's going to be as crisp and answer is that Yeah, so I think so we'll see we'll see when we dive in, but I think this is a great question. So there are a couple of things. There are a couple of things. So one is on the one is on the encoding right and how the data is laid out and especially for for parquet. It sort of prohibits prohibits us from delivering fast random access on parquet. The IO plan is also very different for Lance versus like typical scanner for parquet and or mostly because we want to support large blocks. And then the the the the layout the data set layout indices don't play into this particular problem, but I think it also it it has particularly nice features for for machine learning as we'll see later. Great. Awesome. Thank you. Cool. Any other questions before I would dive in a details. Cool. All right. So let's dig in. So one, as already mentioned, we we try to design Lance to be good for both large scans and random random point queries, right and want to support storing large blobs. And there are some sort of counterintuitive things that you see in in Lance versus traditional like OLAP and columnar designs. So the so the simple design principles here, right is we want to make sure we don't scan more data than parquet or work. And then we want to deliver constant time look up for one row. And then we can amortize any metadata overhead over many look ups. Right. And obviously, you know, parquet was designed, what, 20 to 20 is a 2010 2012 or something like that. Right. 2011 2011. That's right. And so, you know, storage, storage technologies have changed a lot since then. So we wanted to revise, revise a lot of the assumptions around like concurrency threading block sizes and things like that to make, you know, more reasonable defaults to. So the simplest we'll start with is sort of our plain encoder. So this is for your fixed size data types, numeric types, and also embedding vectors and tensors right so this is pretty simple constant direct access with offset computed in flight. You know, pretty easy to support both fast scans and and random access. I think the important thing is also to support different tensor layouts. So that that way when you pull it out of memory, you don't have to do an in memory transpose before you can feed it feed it into the GPU. Right. And so, for us, one of the things that's missing from the data set right now is null support for the plain encoder. So the plan is essentially to add this valid validity bitmap within within each block. So it doesn't negatively impact our end of max performance. So that's the plain encoder pretty simple. Right. And so next is the binary encoder where we have variable length strings bytes, and this is where we would store like images point clouds and things like that. And this is one of the biggest differences between Lance and parquet. So if you're familiar with how parquet is laid out right you get offset data offset data where it's things are interleaved. So you have to read out the whole row group in order to figure out where a single row is right. And so this is so that in the benchmarks where you see parquet forming that performing badly in like random takes, and also, you know, fetching 10 images across the data set. This is one of the primary reasons. And it's exactly it's greatly exaggerated if the if your records size is big. You know, I think vector embedding, you know, like open eye embeddings are, I would say like medium size but like, you know, images, for example, are much bigger and point clouds can be like, you know, like, you know, dozens of megabytes, which is kind of insane to read out 10,000 or 10,000 of them. Did you can, did you consider putting any like a prefix or it was like a blueprint or a sketch in the offset array. So like, you know, instead of having to go deep, you know, do you reference the the the offset jumps to the location and then and then look at the See if you have a match, like for strings if you just had like the, you know, the first two characters or something like that you could store that in a small number of bits and part of the offsets. And at least you get some filter with how to have to decompress the rest of it. Right, right. Um, yeah, I forgot. I think we we went through a couple of choices on the encoders. I forget why now we didn't we didn't end up doing that on top of parquet. We we tried we tried for like months to try to make all this work on parquet but I'll if I if I remember I'll come back to this. Okay. I mean, it only helps for strings like images. You know, you have to see the whole image to figure out what the hell's in it right like for strings you could do this. Yeah, yeah, I think I think likely we're thinking about I think probably the images was the reason because we were very focused on like computer vision in the very beginning. So we wanted this encoder to sort of be be general across these data types. So you're not really doing any filtering though when images right it's just, I want, like you do some kind of matching on metadata, you know, with some kind of other column type. And then from that you take the offsets and figure okay I want, I want the third image in this column and then you just jump you just figure out where it where it is right. Correct. Yeah. Yeah. Okay. Yeah. And so for Lance, we essentially store the offsets together and the and the data together. Right so you read out for a given block you read out the offsets all at once. And then that gives you once you have the offsets you get constant time access. And then basically you have to sort of hold the offsets in memory and sort of amortize every time. So we haven't, we haven't implemented these but like advanced encodings on the roadmap, right run length encodings where you know if you have lots of repeated data, you can compress it using this encoding naturally but also still support fast look ups and then variable encodings where you know each block can each chunk can essentially have its own encoding. Done. Right. And so again, we want to have these encodings to support, like more compressed data, but still support fast look ups. These are these are kind of on the roadmap right now. We'll get to them, you know, maybe like early next year or something like that. Okay, so maybe let me take a pause here and maybe let me turn the question back to the audience, like, so we laid out the data differently. And this data, this difference in data layout allows us to deliver both fast scans and fast point queries. What is the trade off that we made here? Where's compression? Yeah. Yeah, so like it's much harder to, it's much harder to implement compression because we can't do, we can't, we can't do like snappy or, you know, or any other sort of file level compression here. Right. So the, so for us, because we're focused on these like large blob data types, it actually makes sense. So if you, if you have a, if you have a data type or if you have a data set with like images and things like that, your data set size is going to be dominated by that column. And each column if we're storing the JPEG bytes, for example, that's already compressed at the record level. So then if you're looking at a data set like that, the final data set size isn't all that much different between Lance versus like compressed parquet. Now, if you only have tabular data, then like a compressed parquet will be noticeably smaller than a Lance data set. So that's basically the trade off. And it doesn't like, because of object storage, the additional, I think the additional cost isn't huge, as long as you're not like storing it in, in all of it in NVMe or something like that. But, but I think like this trade off will get smaller, hopefully over time when we add more advanced encodings that will help reduce this gap. I don't think we'll ever like completely eliminated, but I think it can be substantially reduced. All right, so, you know, there's lots of like more details about the encodings, but in the interest of time, talk about the next topic, which is IO execution. Right, so the thing that's different about large blobs is that we need late materialization, we'll take a look at that in a second. And then for, you know, modern storage systems, I think it's much easier for us to actually flatten out the IO tree for much faster performance. So, if you look at a query, like if you did on the top right here, right, select ID timestamp lighter cloud from data set where velocity is greater than 10, and tag equals error, limit 10 offset 100, pretty common query that you'll see for in, you know, like autonomous vehicles shop. All right, so the typical OLAP plan for this is we start at the bottom, you scan all of the predicate and projection columns, and then you run the filter on velocity and tag, you take the limit offset, and then you project, you select out just the projection columns. The problem here is that lighter clouds can be quite large, right, so if you're scanning the lighter column in the beginning and then filter it, potentially you're reading out, you know, like you're, I mean, depending I certainly depends on the selectivity of your filter. But, but in a lot of these instances, you're, you're selecting out your reading maybe like 10 to 100 times more data than you have to. Right, so for large blobs and lands. What we do is actually lay materialization where we only scan the predicate columns and then we don't take the projection until they vary in. Now, this is only possible if your data format supports fast random access because this this take operation at the end is at sort of a random range of IDs across your data set. But typically, this is this, but this is another reason why it's a huge speed up for lands versus like parquet or work. I mean, to be very clear, this late materialization is a limitation of parquet and orc. Right, this is right, this is the column store stuff from 20 years ago, right, everyone does this but like, yeah, okay, yep. And then the second is, you know, nowadays people are either storing on like modern MVMEs or object storage, right so MVMEs nowadays have very deep IOQs, and then, and also like object stores can support pretty insane amounts of like parallelism right so for for us, what we want to do is try to flatten out that exit IO tree so to have as few IO data dependencies between between IO calls, and instead try to issue a large amount of parallel is and this helps us sort of really optimize end to end performance for a lot of these calls, like vector search and things like that. Right. And so I think the first two are sort of the, the big pieces in making lands really fast. And then I think I want to talk about two things that make lands more interesting and more useful. One is this layout. And so if you look at a Lance directory. It is, is laid out in the following way so we have Lance files, right, so the directory is the table format right data, data sub directory slash, you know, Lance files these are all the partitions in your data. There's a latest dot manifest that points to the latest version and all previous versions your version history are stored in this in these manifest. And then we have a deletion file to support soft deletes and things like that. So the idea is that for each version. We have a manifest file that can point to index files, and then it can point to multiple logical fragments, which can be one or more data actual data files with its own deletion file. So what this allows us to do is kind of like iceberg is to allow a pens to columnar data and you know schema evolution to columnar data and things like that, and it's stored right right there with the table. This also helps us support really fast time travel, which is very important for learning. So it gives us the ability to, you know, you can append data and add remove columns, without having to copy the, the original data set, and then go back in time and say give me that previous version. And, you know, judging by the popularity of iceberg, we know that this is an important feature, even for tabular data, but for machine learning is actually even more valuable because of the size of your data set right like a petabyte of images, you know, you don't, you don't want to have to like copy the whole data set to just run an experiment to add a new column or, you know, if somebody screwed up and you need to roll back the last, you know, 10,000 images of changes or something like that. And so, yeah, so in Lance format, we use this to also support like real level deletes and updates. One thing that we haven't implemented yet is on roadmap for later, probably like second half of next year is a right right ahead log that can help us support like actual real time fast updates for both the data in the index. So, yeah, so this is a more, this is a more sort of sort of simple visualization of scheme evolution right so on the right here on the bottom you see v one. I wrote, you know, columns C one and C two into the data set in the first version, and that got split into three fragments F one F two and F three. And then I'll add a column C three in in the second version. And the orange or yellow files are the new column we only have to write that new column and the new manifest version will point to both the new files and the old ones. And same thing for version three where we both add and drop a column. We notice that we never have to copy the old data. And if we have to roll back in time like the C three the column that we dropped is still there. All right. And let me just check if there are any any quick questions before move on to this last part in indices. Yeah, I think we're good. Okay, cool. So yeah, I think, you know, park a had some mechanisms to, to be extended to support indices, but it didn't seem to like take off all that much. And, you know, as we. So we first tried to build secondary indices with park a, but because of the take limitations. It was never really worth it in terms of performance for us to spend all effort. And so that was another reason why we design lands that way. And, and we could add sort of indices along with the file format that actually makes a much better, much bigger difference. So as we were in machine learning, essentially the first index we built for our users was the like an in search index. And, and certainly extensible to other index types. So we're working on scaling indices now to support fast filtering, right, that goes with the and in search, but it's essentially extensible. So that's interesting is, you know, I think you guys have probably already heard the basics on vector search. I'm not going to cover it. So what's different about Lance is that the Lance index right so if you look at the bottom right here, I'm going to pretend this vector spaces represents our index. The index will have pointers to the actual row IDs in the in the data set, so that, you know, when you do the search on the index you can then go back to the data set and and select a bunch of columns. You're, it's much more flexible in how you how how and what kind of data you can store in a vector database. And also the because of the format. We also made the Lance vector index disk backed as well. So right now the scaling sort of limitations for vector databases is having the index and memory. And the also the inability to do separation of compute and storage is also kind of coming from that same same problem. And so Lance does sort of reimplements a lot of the same vector algorithms but making a disk back rather than memory backed. And so this allows us to not just support vector search but also integrate things like full text index, full text index, full text search index, and just plain SQL and you know scalar index indices and things like that. Did you have to do something different for that architecture when you are putting it in NVME storage or did you with the algorithms pretty portable or did you have to do something special and extra structures into the indexing. I think we just had to be careful about, about thinking about like laying out the data contiguously, right because if we're, if it's this back that means we're reading data off disk a lot more. And so to get to achieve sort of similar similar things as memory backed indices we have to be very careful but like parallelization and how we read that data and how we store the index data but overall the the algorithms remain the same like whether it's IVF or PQ we haven't implemented HNSW it's a it's a little bit harder to like to have a disk backed version of that we do have disk ANN which is is a graph based index but just different from from HNSW. Thank you. On top of this, right, you know, you guys are probably sick of hearing about, you know, different data vector databases already but on top of this we're building what's what we call the NCB for free I retrieval right so this is. So it's a little bit different in that it runs in process. Right so there's no servers to manage. You can install like SQLite or .db. It's it's lightweight but it's also powerful so you can search like a billion vectors in milliseconds on just on your laptop. And you just need as long as you have a big enough hard drive I guess. And so because of this, we're like, at least one order magnitude less expensive to scale. And because of the column or format backing. It's also way more flexible so you can do vector search you can do keyword search go around the lab queries and all that. So, moving into that, that's pretty impressive billion vectors in milliseconds. Are you assuming that the vectors stay in memory in that case or is this when you're hitting disk and running at disk speed. Yeah, this is this is when we're hitting disk. Do you have a question that does something in between or you're purely going into to disk every time you sort of you search. Yeah, so we do we do have we do have some cash so like parts of the you know parts of the index can be cashed like certain certain partitions will be will be cashed and then they'll just get they'll just get get evicted based on some caching rules so that like the memory pressure is actually very low when we do these searches. In the partitions of your horizontal partitions right of the vectors or is it something else. So the partitions are these are right so these don't require you to actually rewrite the data so maybe like when I say partition I don't mean. I don't mean like, like actual like a parquet partition of the data set itself, but I mean like partitions and vector space. Oh got it. Okay. Yeah. Yeah, so we were storing so the index is stored along like, you know, there's an index block for each cluster in the vector space, but we it doesn't require you to rewrite the data in the in the last format data set. Just want to show you this like really small thing. Okay, hopefully everyone can read the font size here. So let's say if I import lands to be a pip install it and I've got like a local file directory that I connect to as the as my you know quote unquote database. Right so I've got a table here that is, you know, one billion right so you know the 123456789 billion vectors. If I let's say create if I just run a time search like that so basically I'm generating a random vector each time and then in a loop and then I would say you know give me the the top 10 most similar vectors. Right so if I run that and let's see. Right so you're looking at roughly three milliseconds per search over a billion vectors with you know pretty minimal standard deviation here. Okay, okay so in that case probably all that was still sitting in the file system cash right it never really kind of hit the disk perhaps right. What do you mean in that test that you just ran. Oh, you're you're thinking like it's not doing iOS perhaps right at that time it's probably just getting so that from the OS cash layer. I mean we can just do one query I guess I see I see no okay that's fine. Yeah, okay. Yeah I mean there there's some there's some catching going on so but like, you know, if we do let's say if we just do time, I think that just does it once right. I think. Right so so yeah so like that first time is when it's reading the like the core index and the centroids and everything from memory. So this is where you're looking at cold right but but like the amount of data that you have to cash in memory is actually quite small. So here's what here is what it looks like if I if I run it once. So that first time because it needs to hit the disk to read the index and some partitions. So it that first time the cold is essentially slower, but then after that it becomes much faster. Right so so then like with with this space there is also the cold cold start problem of caching things in memory or not. Great thank you. Yeah, cool. All right. Okay, so your roadmap for the data format. We're currently working on like partition pruning stats stats based pruning. So like right now, if you you know you can use duck DB with land or you can use Lance data data format with duck DB, but the but some of the predicate evaluation is still slower than what you would see with like parquet or delta. Because of because of this, and also because of other push downs. Right so, you know with with that we should see substantially faster predicate evaluation performance across the board will have more complete and all support so we support it for variable the binary encoder, but not yet for the fix with encoder. So that's sort of table stakes, right and the advanced encodings we talked about earlier like LRE and variable encoder and also thinking about like time series use cases with like delta delta or gorilla compression things like that. You know we're working on a scalar indices right now. And then I think in the in the future we'll be looking at like things like data compression that can, you know things like FST or something like that can that can also both compress and support fast look ups. And then maybe I should learn how to count but like last item is you know deeper ecosystem integration so that means like, you know a spark data, a native spark data source or like having, you know, arrow and pandas or like duck DB be able to work with Lance data sets natively, rather than through just rather than through our like arrow, our arrow interface. You know, lastly, I want to say, you know, Lance's team effort, you know, big thanks to lay who is my co founder and sort of the primary designer of the Lance format. And then will and Western and Rook, and Rob have all been making invaluable contributions to the data set in the past quarter or so, and also many community contributors, I think Lance format would not be possible without them. All right, so that's that's it for the talk. I hope I hope it was interesting for you guys. And it wasn't wasn't, you know, things that you've guys already covered would love to get your feedback on the format itself you can find us on GitHub at Lance DB slash Lance. If you're interested in the vector database, it's open source at, you know, Lance DB slash Lance. And then we put together a bunch of examples called in a repo called vector DB dash recipes, where you can build like chatbots to multimodal search applications to you know document search and things like that. Right. Yeah, that's it. If you if you want to talk to us like we're on Twitter and LinkedIn is Lance DB, and you can also join our Discord for sort of live support as you're playing around with Lance. Okay, awesome. I will slide on behalf of everyone. Thank you so much. So open to the floor that everybody has questions. I have a question maybe on the workload and let others go. Are you thinking I know you're, you have you have targeted SQL targeted vector databases. You kind of see applications that stress on both from where you're from from the types of things that people are using Lance DB for. So just have to get your thoughts on what are the stress points in the applications that you're seeing. If you say stress points, what do you what do you mean specifically like the lighter example you gave, you know, obviously that came in from the IOT style domain and a lot of the stress point is very large data sets. It seemed like that was a stress point and mixing lighter cloud data along with other good and good metadata and then fast searches that are needed perhaps in real time. Okay, so here here is really. Yeah, so I think it's a great question and here the stress points are really the complexity in the tool chain. So there it's in two places really one is if you look at recent examples from like Lama index and lane chain they talk about getting really high quality retrieval by using multiple recolors and this could be like you have a vector retrieval Recalor full text search Recalor and then maybe like a sequel Recalor and you can combine these results to get much better retrieval results, but they but like at the top level. These like top level libraries have to have this like logic of okay I need to connect to three different data stores one for each type of Recalor, and then I need to figure out which types of queries go to which store, and then And then fetch the results and somehow like sync them back up together right so whereas with Lance this just happens in one data store and you don't need to worry about like, you know, parsing the type of query and trying to figure that out you can Just send it to land CV. So that simplifies things a lot and then having the data and the metadata next to the vectors or your index also makes it easier, you know down the road where you have to like fetch the actual access to serve back to users. So I would say like those are the the big stress points there. Yes. Yeah. It's question I have a quick question. Sometimes you need to trial an error with data a lot before you say okay that's the data set I really want to deal with going forward. And so there's lots of discussion and some tools start implementing branching and merging of data themselves. Any plans on your side. You can. So I think with the versioning system that we have you it's pretty easy to implement branching already. I've not seen, I've not seen a lot of people that need to like merge data sets, if you will, like you definitely want to merge code, but like merging data sets hasn't been something that like, I've seen that people want to do a lot. Yeah, they definitely want to branch and manage different versions, which you can do today with with Lance versioning so that that's sort of yeah. Okay, thank you. Yep. Alright somebody in the chat is asking, what was the biggest motivation to rewrite and rust and how did that play out for you. Oh yeah. So this, the biggest motivation was just productivity. And so I think what happened was we started out writing in C++ and then in December we were doing a little bit of a hack project to put together like a demo, a query system demo for a like a perspective customer and we were like hey it's you know it's Christmas let's give ourselves a little present let's write in rust. And through that project made us we had to rewrite, maybe like 10% of the read path for Lance to, you know to hook into the Korean engine. And we just found like a huge difference in how fast we were able to write in rust versus the C++. So that was number one and the number two that hooked us was like this rust safety, and that, you know, it was, we were a lot more confident to move very quickly and rust and make make releases very quickly. You know, it versus like in C++, it was always sort of this like nagging anxiety as of like okay where's the next set fault going to come from, right and you don't, you don't feel, I never felt like good about releasing, releasing new like C++ by raising into into the world and we just didn't have enough resources to like test it, as well as say you know like the the duck DB folks have. Anybody else otherwise I'll I'll ask more questions. But my first question is what's the encoding scheme you guys are using for the metadata. Like I think orc uses thrift parques protobuf. Are you guys doing anything special there. Is it or you're just using. Protobufs. Protobufs. And that was just for simplicity or. Yep, for simplicity. Okay. And also I think simplicity and also I think the like schema evolution support made it easy for us to like kind of like have quick iterations on the data set data format in the beginning without like breaking old data sets. Got it. Okay. Andrew, do you have a question. No, it does not. Okay. Do you do you guys compute any sketches or anything for like images, like, you know, for example, if it's a high res image compute like a, you know, a 100 by 100 pixel version of it or something like that. At least you potentially do initial filtering based on that and then decide whether you want to go look at the higher res image. Right. We currently don't do that yet. I think a lot of those would be a lot of those would be forthcoming after we sort of sort out our like semantic semantic type support in the format. Got it. And then do you treat the, particularly your computing zone maps with like the the fixed link data. Like a summarization like min max of columns and so forth. So, yeah, so, so the stats that stats work is ongoing right now. So that would be happening. And would that be stored within like the header of a row group or are you thinking that stored as a supplemental metadata. So, there's so there's like page level stats that will be stored in with the fragment itself and then like the the like fragment metadata and the data and column metadata will be stored can be stored externally so that right so for like object storage, it'd be great to have the that metadata stored elsewhere so that we can scan it more efficiently. Okay. And this last one is sort of more of a philosophical question like why do you think it's a good idea to store like the vector index with the data. I guess maybe like if the index, if the index is a version, then maybe that's okay, but I'm thinking like the use case of, oh, I, you know, I downloaded the new model of a hugging face and I want to now create a new new embeddings for all my data. Like is that is it just because everything's all together and therefore makes it sort of from a software engineering standpoint easier to reason about here's a directory of all the data for it. So what was sort of the design philosophy behind that the motivation behind that idea. Right, so the, so yes the index is versioned so that if you like, if you, if you like, overwrite the data, if you have a version where you overwrite the data set that essentially invalidates the index and you can regenerate it but you, you can still roll back in time so you have like the you have you can have a consistent index and data data set state inversion. I think certainly having the two go together was ease of ease of use for sort of Koreans at the top level and users. And then I think for, and I think like for us it was when encountered a lot of ML engineers just having a tough time like managing the index separately from the data, and then you know having to like sync up two or three hours because of that. And it kind of not only like makes their queries a lot slower so like end to end, but also makes their like pipelines a lot more brittle. Okay, awesome. And then, I guess, and that's already, and nobody else has a question I'll sort of the last one, like, and actually you've kind of already answered it but there's a, there's a, you know, I would ask like, what are the other things you wish you had that Lance TV or Lance the format had now but it doesn't have, but you already sort of listed out a roadmap. So maybe can you talk about like the big vision like where do you want Lance to be in five years. Oh, that's a. Yeah, I mean that's a that's a great question. You know, so my dream would be to make Lance super useful so that in five years, if you are managing lots of multimodal data, be it images to point clouds or videos, and or maybe some mix of like tabular data vectors and AI data sets that, you know, the first thought on your mind would be okay I should be using Lance because it's really easy to use it's sort of integrated into all the systems that you would use on a daily basis. Right and you know Lance would be, of course, like, you know open source and you know hopefully it would be a it would be a standard where whether we are collaborating others or emerging with other formats. I think that would be the dream where we have sort of one replacement for like parquet and orc in this new world where data is much more than just tabular. Got it. Okay. And actually, this one we can cut but like I'm actually, if you don't want to say anything publicly, what's the business model for this right because it's, it's plumbing. And as you said, like you wanted to be hooked in with someone's got something on Delta Lake, you can you can suck it in there like, you know, you know, if you have something on S3 I mean the Lance format like what's like what are you actually selling. So right now we'd be so we're working on a hosted service for the cloud vector database. One commercialization, but there are lots of other applications around like compute and training and visualizations that can be built on top of this kind of infrastructure.