 The Carnegie Mellon Vaccination Database Talks are made possible by AutoTune. Learn how to automatically optimize your MySeq call and post-grace configurations at autotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org Today we're excited to have Stavros Padavelis. He is the TPO and co-founder of TileDB. He's here to talk about the system that he's been voting for several years now. So I bet Stavros, through the collaboration between Akativa and Intel through the big data ISTC, founded by Mike Strombrafer and Sam Mann at MIT. And so Stavros was hired to be on Intel Labs as a resident researcher at MIT working on this project. So he has his PhD from HQSD and there's also he was a visiting assistant professor there before he joined Intel Labs. So as always, if you have any questions for Stavros, as he gives this talk, please unmute yourself, say who you are and where you're coming from and feel free to do this anytime. That way he doesn't feel like he's talking to himself. And again, Stavros, thank you so much for being here. Go for it. Thank you, Andy. Thank you everyone for attending. I'm very happy to be here and excited to talk about TileDB. So let me get to the point directly. A little bit of a disclaimer that I always add to my presentations. So all the credit for the amazing work that I'm going to describe today and every time I do a webinar goes to our powerful team. If you want to see who we are, just visit this link at TileDB.com slash about. You're going to see everybody's background and you know what they're working on. But I am the exclusive recipient of complaints. So if anything sounds wrong, you come to me. A few things about who we are. Indeed, TileDB started when I was at Intel Labs in MIT. It started as a research project. And then we had some success with the open source part, some very highly visible genomics use cases. And we decided to spin it out of MIT in 2017. The project itself has deep roots at the intersection of high performance computing because of Intel, databases because of CCO, and data science because we got exposed to a lot of use cases from the scientific and the analytical domain as well. We have raised some money over the years. We feel we'll capitalize at this point. We are 14 members. And in addition to being of course software engineers working on the core database technology, we have a lot of expertise across application domains. And this allows us to capture a lot of the use cases and see the common things and compile those common things into a universal database, the one that I'm going to discuss about. And we do have quite the traction now with telecommunication companies, pharmas, hospitals and other scientific organizations. So you're going to see more of that very soon over the coming months. So here's what I'm going to talk about today. First, what the heck is a universal database and what is the need of it? Why no one is building one? And we are the only ones. We're not the first ones, but we are the only ones doing something like that. What is the TALDB secret source? How did we pull it off? And I'm going to explain this very important data structure called arrays, multi-dimensional arrays. I'm going to show you a little bit of the architecture and the features of TALDB products and all of that. I'm going to talk a little bit about the use cases. And then just a couple of notes on the future of data management, the way I see it. Okay. And this is a personal opinion. This presentation is a mix of a couple of webinars that you've been going on. You can find them online. I can show you some links later, but I try to create a good summary so that it's self-contained for this particular talk here. All right. So let me give you a little bit of a background in order to set the context for the universal database. So first of all, anything that has been going on in relational databases is beautiful. I mean, this is solid work. We're talking about decades in the making. And we've done some amazing things as a database community, some amazing things. There's a lot of sophistication here. You name it, relational algebra and SQL, rows, constraints, row stores versus column stores. We've done a lot. OLTP versus OLAP, by the way, this presentation is going to be mostly based on OLAP, but I mentioned it here as well. We can be working on OLTP and OLAP, shared nothing architectures, and a lot more. So a lot of stuff has been done in the relational databases. We should not forget. We should always study this. Anything that is being built today, even a TALDB is built on the shoulders of giants. So let's not forget about this. So this is good. Now, where's the problem? The problem is that in many disciplines, mostly the sciences, but not only the sciences, even in the business domain, when they use science. It's not just the sciences. There have been a lot of data being generated, much more to data in the past. In the past, we didn't have this problem, but in the past decade, we have been having this problem. So I can enumerate a couple, but it's not limited to these domains, so genomics, petabytes, imaging petabytes, LiDAR, SONAR, AIS, these are point clouds, petabytes, whether petabytes, maybe exabytes, with simulations and everything, and many more, many more, a lot of data. So here are a couple of notes. This data is too big for traditional databases. Even those that we had in the past, there are some new architectures that are trying to catch up and handle this, but fundamentally, it's a different beast. This is not what traditional databases are handling today. Also, they're not very well represented by tables. So if you have a relational database, that's not a good bet for this kind of data. Other database flavors that give you more flexibility are not ideal either. The performance must be amazing here. You cannot just get away with a little bit of flexibility. You need to be flexible and performant, so let's not forget about that. And some news for you guys, if you don't know, not all scientists, I mean, working in the sciences like databases. So they're like other stuff, like Python R, their own tools, not databases. So it's not a given that you're going to hand a relational database or a SQL database to those users and they will use it. It's a different beast. They use different tools. Another thing is the cloud, right? What happened and what is shaping the database community again to set the context for universal databases? So the cloud happened and storage and compute got out of hand and the cloud gained popularity. And the biggest thing that happened was the separation of storage and compute. And Andy, I was smiling the other day. I remembered one article that you published at Two Sigma. I don't know if you remember in 2018 or something like that. Where you mentioned this, I was reading it at the time and I was agreeing. I was telling, because we're building the company already, and I was telling them that this is a big deal. Like the shared nothing architectures cannot work in the use cases that we're targeting at all. It has to be shared everything, shared disk. So this of course happened as we expected. And now we're seeing it more and more. Everybody's moving away from shared nothing to shared disk. Cloud objects are cheaper. And guys, when we talk about petabytes, this is important. It's important. If those organizations are moving to cloud object stores, there is a reason. It's very expensive and not to go there. You need to go there. And all database architectures did not work. Like the shared nothing architectures did not work. And some other even shared everything architectures, they didn't work off the shelf because the object stores have different SDKs, object immutability. It's different. It's not POSIX systems. It's different. Like you need to architect the database different from a storage perspective. So there was a new paradigm along the, because of that, data lakes, lake houses, whatever you prefer. And the idea is to store all data pretty much as flat files on some cloud stores. Use a hammer, which is usually a computational framework. And treat data management as an afterthought. This has happened. I don't want to name every single case. But this happens today. Like we don't like the databases. We forget about all the sophistication of databases. We dump our data in files in cheap cloud object store and we use a computational framework as the holy grail to solve our problems. And we solve nothing. The problem is data management always. And I'm going to get back to that. So data management is a hack here. And that's problematic. And the other thing that has been happening is machine learning. So huge hype. Everybody wants to jump onto the next thing. And for a reason, I mean, you know, several organizations start to adopt machine learning. Machine learning is not de facto in many organizations. They're starting to adopt it mostly at the research side. And many great new frameworks and tools around the metal. Super solid. People started to like coding. I mean, machine learning has attracted a lot of people in computer science and coding, right? Software engineering. So great. This is great. And machine learning facilitating, you know, catalyzed data science. Right now, data science sometimes is intermixed with machine learning. That's not the case. But data science skyrocketed because of machine learning. Now, there was one important mistake, and I don't know if this comes as a surprise to you guys, but again, some thought-provoking statement. People thought that machine learning is a compute problem. So tons of investments, GPUs and all of that to reduce the time, increase the performance. And this was solved. Like we did some great stuff there as well, right? But the problem is machine learning is a data management problem. It's data management, guys. Models. You version them. Who has access to them? What kind of data provenance? What kind of data are you slicing? Who has access to them? Logging. Regulations. A thousand different things all related to data management. The compute is kind of solved. So machine learning is a data management problem. All right. And then there was a mess. So because of all of that, and because we wanted to depart from a relational database because it didn't fit our purposes for whatever reasons, we created way too many formats and way too many files. And I will explain these cases later. This is a fact we store them in cloud buckets. And metadata held because those files need to be cataloged somehow and you need to attach some kind of metadata. So you create metadata systems. And then data sharing became overly complex. What are you going to do? IAM roles on AWS? This is not going to cut it, guys. You cannot have just file-based access policies. Most organizations want other semantics. I want access to be constrained on a gene for a particular population. I don't care about files. So this became super complex. Machine learning gave rise to all those feature and model stores. It's crazy out there with how many companies are getting created. Thousands of data and ML companies and open source tools. Cloud vendors keep on pitching you guys that you need to have a special purpose built system for every single thing. And they're giving you hundreds of tools with funny names. And what is the problem with that? The problem is that data management, including machine learning, became the noisiest problem space in the world. It's very noisy. Before, we had some monoliths with databases in SQL. And now it's just a mess. Like there are way too many systems and we don't know what to choose from. And, you know, David... I want to deal with you. You're in a role here. When you talk to customers, what file format do you see the most often? The HDF5 thing? Or is it ready? This is a great question. So tons of Parquet files for tables, Parquet. Delta Lake, they use if they're married to Spark, maybe Presto. But otherwise, they create their own hacked solutions. So they partition... Do you know how you partition hierarchically the Parquet files and whatnot? That's a hack, right? You hack it. Create hierarchies and whatnot manually. And then you create a different catalog or you use something like Hive. And then everything's a hack. Like it's very purpose-built to your organization. So... And every organization does the same. So Parquet, tons of HDF5 files. HDF5 does not work on the cloud. There's a proprietary kind of system. But the actual files in the library, the open-source library, doesn't work on the cloud. So HDF5 files, a lot of net CDF for weather, tons, tons of net CDF for weather, which is HDF5 essentially underneath. It's a wrapper on the HDF5 format. And then what else? For LIDAR, LAS, LAS, or LAZ, tons, petabytes. And we're talking about hundreds of thousand files. And then in Genomics, VCF. Genomic is very interesting. Right now, it's a beautiful and very challenging problem. So VCF. So you see, I already enumerated like five, six of them. And there are so many others. GeoTIFs, TIFs, PNGs, and I can go on and on. Tons. Awesome. Thanks. That was good. All right. A summary, guys, because I gave a lot of context, but here's a summary. We are working with some organizations. And I like to do that, even from the inception of TALDB. I like the use cases that kind of matter for humanity, that kind of use cases. So we're working, for example, with big hospitals, like trying to save infant lives, the ICU. And they're completely lost in the noise, right? And this is completely justifiable. Those guys are scientists. There are medical doctors and whatnot, right? Like geneticists. And they're trying to learn about data systems to solve their problems. And they're completely lost. So one of two things happens. And this, this is happening routinely, guys. Number one, they get consulting or they research on their own and they use, let's say, 10 systems together. And they try to orchestrate because, you know, they have one for clinical data, one for genomics, one for imaging, one for something else and so on and so forth. And they need a big data engineering team to make sense of all of those and put them together, which is very, very difficult, especially on the access control and security, which is certainly important for those organizations. Or they build everything in-house. They say, you know, I'm not going to deal with that. I'm going to build everything on my own. The way I told you before, Parquet files, VCF files, files and some kind of a system that they build in-house. And it takes them years and a lot of money to do so. And we saw that in multiple organizations. So they build the same system with kind of the same principles instead of using an actual, you know, enterprise grade system that does all of that. That was our observation. So tons of reinvention of the world, tons of it, tons around, you know, by-range, range requests on S3. They do it on their own with the SDK, for example, metadata, catalogs, all of that stuff. They do it again and again. Because there is huge overlap, right? It's the same system. They want the same solution. And scientists spend most of their time as data engineers. And please trust me when I tell you that those guys are brilliant and they can, but they don't want to. They can do it. They can learn this stuff, but they don't want to. So effectively what's happening is that organizations lose time and money, a lot of it. And most importantly to me, science is being slowed down. Like this noise, you know, creates obstacles to scientific discoveries. We should all be stressed about these guys, very stressed. So this is what's happening. All right. So what are we suggesting? We're suggesting something, very odd issues, obviously. Let's see how close we are to getting there. Here's the idea. We create a single system, but with efficient support, emphasis on efficient guys, because otherwise I can tell you ways to pull it off if it's not efficient, but efficient support, all types of data, all of it, even machine learning, all types of metadata and catalogs, backends, computer hardware, you name it, APIs, integrations, all of it. And on top of this, because it's a database guys, let's not forget about database aspects, authentication, access control, logging. This stuff is important for organizations. It's not important in the open source, usually, because you want to do the sexy thing and you don't want to care about these tedious things. These are the tedious things that get you into Fortune 500 components. Like without those, you're just a tool. And infinite computational scaling, by infinite, of course, we mean that it scales within the number of machines that you have. If you can pay on Amazon or to build a powerful data cluster, if you extend the data cluster, it should scale. That's what I mean here. And by the way, it should scale beyond SQL. SQL, for sure, but also machine learning, linear algebra, which is at the core of a lot of advanced computations, and even custom, big emphasis on custom, because you don't know what these scientists want to do. Usually, they know what they want to do today, but in the future, they have other ambitions and they want to extend the system. So custom is very important here. You should let them do whatever they want, not just goodbye, join. These are important, but more than that. And finally, sharing collaboration, super important right now. Even monetization. And by the way, I'm saying monetization here, because it is derived from global sharing. Once you figure out how to do global scale sharing, then monetization is just a good integration with something like Stripe or another payment system. And this is doable so that it's just a nice side effect of sharing cross-organization. What are the benefits? I'm just going to enumerate real quick. So first of all, single platform. Like one platform for the genomics data, the clinical data, the x-rays and so on. So for the same for, I don't know, maybe AIS data, which are ship locations along with SAR data from satellite imaging in order to do training, in order to do non-dark ship detection. There are a lot of cool use cases that are part of one system. Single data platform for authentication, access control, auditing, all of that. You don't need to orchestrate 10 different things. Global scale collaboration. This should not be an afterthought. This should be the forefront of the next generation databases, how to collaborate and to collaborate on runnable code, not just code on GitHub, a clone, a try to deploy. You need to automate that stuff. The infrastructure should be automated. It should be just a function that I call. This is what I mean by runnable. Then, superb extensibility. It should work with any API. Why only SQL? It should be Python and efficiently, not through ODBC connectors, more efficiently than that. So R, Julia, whatever, like, or the tools that the practitioners are using. Then this helps in modularity and API standardization, so that sure, there can be multiple universal databases. As long as they agree on the API, then let them compete. But let them compete not at the expense of the user. The user should continue to use one API that everybody agrees upon, or at least similar APIs, very similar to what we did with SQL, and let everybody compete. That's okay. There can be multiple universal databases. Then it should facilitate the creativity of the users, guys. I'm going to get back to that. I'm going to return to that. But the users should be at the forefront at the center of it, not the database. It's the user and what the user can do with it. I'm going to get back to that. Finally, and at least most importantly for me, it's a future proofness. The database itself should be extendable if another data type comes into the organization. We're working, for example, with a reinsurance company. The insurance companies may think, okay, they have tables. Well, they also have images. And they also monitor weather for catastrophic events and stuff like that. So what are you going to do? If you go with the data warehouse, which works with tables, what are you going to do with the images? What are you going to do with the weather? You're going to build another database. You're going to buy another vendor. So this can be very future proof if you can handle all the data and then you just go from schema to schema to schema in order to accommodate all of those use cases. And also it can extend to multiple backends. For example, I don't want to be ever to be obsolete only because we shifted from the cloud now, from cloud object stores to something else like NVMe or something else. It should be built modularly so that if another backend comes up, we don't become obsolete. Now, there are databases that are becoming obsolete because they're slow to adopt the object stores. So future proofness, super important. And finally, no more noise. Hopefully, if people are going to agree on those APIs, I'm going to say, okay, that's how I use it. That's how I customize to my use case. And then I don't have to learn a thousand different things in order to pull off my use case. So quite bold. Let's talk a little bit about why no one has built it, right? Or no one is building it. So let's talk a little bit about other universal databases because people can claim that this has been done before. There are ways to tailor past systems and say, yeah, there were some efforts. So let me tell you two, which are kind of the obvious ones. So first of all, anybody with the old object relational model like Informix, for example, because anybody can say, yeah, I can model anything pretty much like C++, object-oriented programming. And I can have a very versatile programming model and data model. Sure, but this could get unwieldy. Like if you see any complaints in the past about the object relational model, this can become very, very unwieldy. Sure, a lot of flexibility, but it's like sky is the limit. Like you can do anything and there is no coherency. This could get very unwieldy. And not enough focus on performance, a lot of focus on the flexibility part, but nothing around locality of results on the desk, vectorization, stuff that matters universally for all the data types, for all the classes that you're building in this model. So a lot on the flexibility, but not so much on the performance, and also not so much on interoperability because this was done like 30 years ago. So no Python at the time, people were not using Python R and all of those languages as they do today. So there was no need for that. And also no focus on the back ends. So no focus on object stores only because again, that was not on the table. You had the monolith and that's it. They would use the best possible file system and that's it. So there is a reason why they didn't do that. So for anybody who has this in their mind, that's not exactly universal, at least not the way we define this. And finally, of course, no SQL, right? Anybody can say, okay, no SQL, I can model pretty much anything in an image. It's a document in MongoDB or something else. Sure. But again, some images are a terabyte long and you need to slice them. How are you going to do that? If it's a blob, it's the same as a file. So you're just creating a catalog. It's not that you have an analysis ready format for the images, the lighter and so on and so forth. Yes, you can do it, but not efficiently. So everything I'm going to explain today about MongoDB is on the efficiency front, how we do this efficiently. But there can be a claim, yeah, somebody else did it in the past with flexible models. Okay. You're going to see the difference soon. So why no one is building in now though, right? Like in the general sense of universality. So think about it a little bit. We are completely stuck in an echo chamber. Why? Because take a look at all the cloud vendor marketing campaigns, guys. They talk about a thousand different purpose-built systems that they're selling to you. There's no incentive. There's no chat about it. Nobody is discussing about building something universal so that this noise stops eventually and we help all those domains. They go to create one system for every single thing. So that's the trend right now. The other thing is that some purpose-built systems had success and that was inevitable if you do the very first purpose-built system which crashes everything. Of course, it's going to succeed. But then you create the second, then the third, then the fourth, then the thousandth and this is when the problem gets created. So with one system, it's easy. But with a thousandth, it's not. And there is tons of funding around. There's a lot of capital in the VC world right now and it gets constantly poured on incremental solutions. Like, you know, I joke sometimes and I don't hesitate to do it even here. You know the recipe. If you have a good alma mater, you come from a top school and then your system is on GitHub with many stars on hacking news, that's it. Go create a startup that's going to give you $10 million today. This happens. So if we encourage this behavior, then the mess that I explained gets created and prove me wrong. This is exactly what is happening today. Look at all the latest finances. So anyway, the other reason is that universality intuitively seems like a lot of work. You must be crazy to do it. I mean, even if you, when I was building the company because I was trying to read as much as I can on the business side, I went completely against every single book that I read in terms of the go-to-market and all of that, the positioning and all of that. Usually you just target one niche. Like you go after genomics, that's it. You target LiDAR and that's it. You never position yourself as horizontally as we did. Our sales are vertical, as I mentioned in the beginning, but the positioning is this is universal. Why should I hide it? It is, right? So universality seems a lot of work and there is no incentive from a founder to pitch it as universal because it's going to be very difficult to raise money. Like the VCs want something much more targeted with a go-to-market strategy, a specific budget assignment, so on and so forth. So I did it the wrong slash hard way. It's paying off now, but at the time I didn't know. And finally, the most promising data structure got overlooked. So what we're doing is that we're using multi-dimensional arrays, but multi-dimensional arrays could be used before. But the problem is that they weren't used by solutions that go traction, but that doesn't mean that multi-dimensional arrays failed. It means that those solutions did not get traction. And also the arrays were never used to their fullest potential. Like most of the use cases and most of the storage engines or databases on arrays were on dense arrays. And I'm going to explain what that is in a minute. So not the full potential of arrays. All right. So what are arrays? What did they use before and how we use it and why are we going to make a difference? So what is an array? An array is a multi-dimensional structure like the ones that I'm showing here in the figure. So it comes in two flavors, dense and sparse. Dense cases on the left can have dimensions. They're integrals. You have to be integer dimensions. So these things were finite. And you can have metadata. You have the cells. The cells can have any value. It could be a integer and a string and the flow. It doesn't matter. But the most important thing is that all the cells have values. This is very important. All the cells have values and we never materialize the coordinates. We don't need to. I'm going to show you why in a bit that we don't need to materialize those. All right. Then the sparse case is very similar to dense, but it comes with a couple of differences which makes the engineering part very difficult. And also this is the differentiator. Like this is the thing that the previous databases and engines did not have entirely be unified this model, both dense and sparse. So first of all, the majority of the cells can be empty. They can be empty for many, many reasons. But also because the dimensions can be heterogeneous, like you have one integer and one something else, character or string or whatever. And also the dimensions can be nonintegral. They can be strengths. They can be floats. In other words, the space can be infinite. So how can you make it dense? How can you put a value for every single cell if it is infinite? Right? So this is important. Then some other, you know, more details like potential multiplicities and stuff like that. But the most important thing is heterogeneous and infinite dimensions. And then the fact that we don't materialize the empty cells. We don't put nulls because the storage is going to explode. So this is what creates the difficulties in storing and processing and indexing and all of that. But this is what was missing. Because if you have both, by the way, these arrays give you very fast slicing in multiple dimensions. That's the important thing. You can slice fast. I'm going to tell you why this is very important. But this is the basic operator. The other operators you can build on top with nice engineering. But if you call yourself an array engine, the multi-dimensional slicing is the primitive operation that you need to get extremely fast. This is what makes this very, very powerful. All right. So why arrays? So forget about your application. Remove the jargon. And, you know, just think of the data that you have to store. And now try to reduce this problem because at the end of the day, you need to store this data in some kind of serialized form in a one-dimensional meeting by the addressable one-dimensional in the general case. Right? Also, think of any computation. SQL query, linear algebra query, it doesn't matter. You can always define it as a task graph. And then I'm going to get back to that in a bit. And where every task slices, every task slices. So at the end of the day, you end up with byte string. And you need to slice fast regardless of the application. It doesn't matter if it's genomics. It doesn't matter if it's slider. It doesn't matter if it's tables. That's the bottom line. And you start from there. And if you get this very, very fast, then you can, again, start engineering the rest of the stack and get that part fast and so on and so forth. But we're talking about the very basic operation from IEO, starting from IEO always. So here's some truth. Performance is absolutely dictated by the result of that slice. Locality in the one-dimensional meeting, for example. This seems quite okay. You have two regions. Could be one. Could be more. But kind of, you know, self-contained and contiguous. So there is high locality of the result. We're talking about the result. Whereas this is bad, right? If you end up with a layout on the disk and you get one chance, even with replicas, one chance per replica, if you lay out the data in a bad way for the majority of your workloads, performance will be bad. Even if you vectorize, even if you use GPUs, it doesn't matter if IEO is impacted massively, you will never get performance ever. Even if you parallelize, you will just be spending money. That's all. So result locality is what dictates the majority of the cost. And what are we saying about arrays? Well, arrays give you the most flexible way, at least unless somebody bets that, the most flexible way to play around in order to lay out your data in so many different ways that at least one is going to fit your application without building another database. We built one on arrays with different configuration parameters and then it's a matter of choosing the parameters. But the system works. You don't need to overfit it to tables or to overfit it to images. It's going to work for both, depending on the configuration parameters. So a lot of stuff here that I'm going to go through quickly and then I'm going to show you some figures so you can give importance to dimensions. Which dimension is more important than others? In some applications, some dimensions should be more important than others. Then choosing whether dimension coordinate should be materialized or not, dense or sparse, building indices, considering compression, all of that because of arrays, different chunking, different compressors, abstracting the engineering on top because the API is very clean, very intuitive. You have dimensions and attributes. That's it. For all applications, you can think of. Of course, it unifies the data model. It unifies the storage engine and then you can unify access control, logging, compute. The whole stack can be built once you build the storage engine this way. That's why we're moving towards this universality idea. It's not clear. For all the things you listed here, you're expecting the end user to know how to set these things or you are doing it as the vendor? Excellent question. So far, I'm saying that you have the flexibility by building one system to have a couple of configuration parameters which are intuitive like tiling. I'm going to show you those in a bit. And you can, very similar to building a schema for database. It's very, very similar. But it gives you a lot of layouts. And one is going to work for you. But for specific applications, genomics, LiDAR, SAR, we build our own ingestors. We always build an ingestor that takes the standard format last, for example, and builds a three-dimensional sparse array, for example, for LiDAR. For those, we have some default parameters that have proven to work beautifully. And more often than not, there is one configuration that works empirically. Of course, you can tune it. But empirically, this default is going to take you a long way. So for genomics, it's called a major, for example, on one of the dimensions in LiDAR, it's Hilbert, and I can go on and on. Okay, thanks. That's a great question. All right. So an array is, by the way, some data frames, right? Why? Because suppose you have this data frame, you have four columns, you have two ways, depending on what you want to do. There are two ways to model this in an array. You can model it as a one-dimensional dense vector where the implicit indices are the row indices. Why would you do that? Because you may have no idea about your workloads, and they may be very random. And more often than not, you need to scan. More often than not, there is nothing that will give you something better than scanning. In that case, even in that case, you want to distribute your scans. So even in that case, you need to slice in rows from 1,000 to 1,000, 1,000 to 2,000, and dispatch it to different workers. Even that needs to be very fast. The i-thousand rows have to be retrieved instantly by the worker to scan. So that's the degenerate case, but it works. And this resembles very much Parquet. Actually, this case is almost identical to Parquet, almost. Or you may say, wait a minute. From those four columns, I have stock, for example, and time, which is saying that two of the columns are used more frequently than anything else. Well, in that case, you'd better make them as dimensions, because 99% of the time you're going to be slicing on those. So make them dimensions, make a sparse array, because every combination of the values is a sparse point in an infinite domain again. And then what you achieve with that is extremely fast slicing. And then you can continue distributing and filtering further. You can do all this magic that we know from columnar systems and warehouses and other systems. But at least this very first touch of the data is going to be rapid. This is what this gives you, like a faster index kind of thing. So what else can we model SRAs? And again, I'm just going to enumerate if anybody has any interest, I can get into details. LIDAR, SAR, population genomics, single cell genomics, we have use cases and customers with those in those cases. Any kind of time series, obviously, right, because you sort on times perfect, you can slice in time extremely fast. Whether graphs are two dimensions, we model them as two-dimensional matrices, which are adjacency matrices. So we can do mathematics like linear algebra to model graph algorithms. You can be very flexible with that. Video, key values, even flat files. The flat files are just one-dimensional blobs that you can slice and compress and so on and so forth. And I'm going to tell you a couple of use cases for this. We use it in the product. So I hope this paints a good picture on why at least our arrays are not ad hoc. There is a reason why we use them. It's their foundational, they give us a lot of flexibility to order the data and slice the data. And then if we build a powerful engine on this, then on top, we can unify a lot of stuff. We don't need to rebuild them. I see a question. Somebody raised that. This is Shubham. So I have just one question. So can you tell me whether there is a redundancy which is added to optimize for the IO of slice, because of the slicing need? It's a great question. Not natively, but there are use cases where we do it manually. We have a use case, for example, for AIS, for ship locations. We store the data twice with different dimensions because there are two powerful workloads, not a thousand, two. And then depending on the query, we direct the query to the appropriate array. This is a good question. It's manual. You need to build it upon ingestion. It's not native. So it's not configurable. This is something we want to work on. In fact, I'm meeting with Sam next week for something like that. But this is something that people are looking into. And we'll be happy to see the solutions and try to integrate it to talk to me. It's a good point. The good news is that we compress everything quite a lot. So, okay, storing the data twice is not as bad as you might think. We compress by effect of 10. You see, like for those use cases versus to what they were doing, we're compressing with a factor of 10. So storing the data twice, they're happy. They're going to be spending less on the cloud anyway. Okay, a little bit about the format. I'm going to go through those details a little bit fast, only because those details exist in the webinars we have. If you're interested in learning more, you can find everything that I'm covering here in much more detail. I'm just going to be respectful of the time so that we don't exceed it. So this is a dense array very quickly to take a look at the format and see how we reason about it, why it's cloud optimized and so on and so forth. So it looks like this. It's a hierarchical format. It's no single file. So there's a folder, the array name, and then you have the schema here in this folder. The schema is timestamped because we have schema evolution. So if you can time travel, so this is really set since epoch, for example, then we have the fragments. The fragment is a batched write. Once you write once, this creates a subfolder. This is also timestamped. Again, this is how we do versioning. This is how we do time traveling. And then you have some metadata here about arteries and whatnot, small information. And then these are the data. The data is columnar. So for every attribute, different file and tiled, compressed and so on so forth. And for the sparse case, it's very similar. The difference is here. We store the coordinates of the non-empty cells. We don't store empty. We store the non-empty cells in the coordinate format, COO format. So you have one for the first dimension, one for columnar. Because it leads to better compression, you can subselect everything you know about columnar database. It's nothing new here. We just adopt it. So very, very simple. It's cloud optimized because everything is immutable. We never overwrite anything. Everything is a new batched write. And we do a lot of optimizations around list requests, around, you know, disqualifying fragments based on some minimal index information, a lot of sophistication. But this is the main idea. All right. Also tiling the space tile, which I'm going to show you is the atomic unit of IO. So think about it like this. The RA may be two terabytes. Of course, we need to chunk it somehow. This is called the tile. And for the dense case, it looks like this. And this is just the motivations. Very simple things here. For example, if you say, okay, my tile is going to be four by four. And this is the configuration parameter of the schema. This is part of the schema, by the way. Then if you slice those four, obviously, you will have to fetch the whole tile. Whereas if your space tile is two by two, and you happen to slice this, you're going to slice only this piece of information. Very simple indexing information. But this works very, very well. And each tile gets compressed and filtered and so on and so forth separately. The layout is the most important one. As I mentioned before, a lot of different orders. Again, I'm not going to get into the details. There is a webinar for that. But it looks something like this, guys. With three configuration parameters, look at these space tiles. So two by two tiles. And then order in a tile, order across tiles. So take all the combinations, and you're going to find yourself in a situation where you can pretty much lay out the data in any way you want in the multi-dimensional space. That's the important thing. You can always do it with parquet by sorting first on one column, then another, then another. But always you're going to be preferring the first column in that case. You're not going to be efficient in the others. But there are applications where this has to be mixed or it has to be even. So this allows you to lay out your data much more flexibly. And in the sparse case, because we're not storing the empty cells, we do something similar. So we have an order. The order is defined with the same parameters. But then we define a capacity. For example, here is two and here it's four. And this creates the actual data tile. So you see, we're effectively storing only the blue cells here. We're not storing the white. And this indexing information, we have these minimum bounding rectangles or NBRs. All right? So we always store non-empty cells. But again, the layout is important because you want to retain the multi-dimensional locality on the one-dimensional medium. And we have some more complicated stuff like Hilbert order. Again, please look into, watch the webinar that we had a couple of weeks ago and then you will see. Yeah, another question. So can you go to last slide? Yeah. So here is just one question. So let's say user writes in the byte block. So do you rewrite certain blocks again? I'm not sure I understand the question, but let me explain how you write. So effectively, you give all those values to TALDB in three vectors. In this case, it's the row dimension, the column dimension, the coordinates, right? And then the actual values could be more values. So no worries about that. TALDB will sort very fast based on this order. This you have already pre-specified. You don't need to provide the data in this order. You just put the configuration parameters at the array creation. Then TALDB takes care of everything. And then TALDB is going to sort, compress, storing the files very fast with parallel IO, multi-part uploads for S3 and all of that stuff. All is handled by TALDB. Okay. So my question is a little bit different. So let's say version one of the files, something like on the left side of the slide. And then version two of the file changes someone one white tile to blue that something. Oh, I see. I see. As I mentioned before, everything is immutable, right? So effectively what you're going to do, I'm sorry, now I understand. So I'm going back. You see this fragment here? The first right is going to create this. The second right is going to create a new one under under this timestamp. And TALDB is smart enough to compose the result. And then we have consolidation mechanisms, very, very similar to LSM trees, log structure merge trees. So everything's like a log. So if there is a frequent right, it will be just a diff which will be stored in the subsequent types, right? Correct. If you have a lot of rights, you're going to end up with a lot of those. You need to consolidate faster or buffer. So you need to make sure that you handle the velocity of the data one level higher than the storage engine. So if you buffer, you pass, you answer queries. So this is going to come in the future. So far we don't have many high-velocity use cases. We don't do a lot of streaming. We can, again, foundationally, we can. It's just that we're not paying too much attention to streams right now. We're quite busy with the rest of the stuff. But the way I would do it, I would buffer internally. I would handle queries internally. And then once in a while, do a batch write back to the disk. Okay. Thank you. All right. If you think about TAL filters, again, this is common knowledge. We've done this in databases very, very well, especially the columnar databases, but we adopt this. For example, we chunk each tile a little bit further. Each chunk fits the L1 cache. Then we dispatch each chunk to a different core. And then if there is a pipeline, because usually it's like a compression and encryption, then we perform in a filtered pipeline so that we increase the locality between L1 cache and the CPU. And we don't have a lot of L1 cache misses and all of that stuff. So all the goodness that you know from columnar databases gets adopted by TAL. And one more thing about indexing. And this is quite important. In the dense case, we don't need XR indices. For example, if we know from the schema that they have two by two tiles, forget about the order for now, two by two tiles, if a slice comes like this through simple arithmetic, no indices, fast arithmetic, I can tell that this includes two tiles, this upper right and this lower right. I know it. So I can go directly to the actual file and grab the tile decompress and so on so forth. So no, it's arithmetic indexing, implicit indexing rather than an actual index for the dense case. That's why this is super fast. So if you have an image in TAL, it's rapid. Don't store it to one by one, the pixels as rows in the database. It's going to be very, very slow. But in this parse case, remember, each data tile has a minimum bound rectangle to store that. Then bottom up, we'll create an R tree. R trees, although they don't have asymptotic guarantees, they're extremely fast and very quick. Like we use them all the time. They're bulk loaded. We don't have insertions, deletions and whatnot, only because we write as a log, right? We write a new batch every time stamp. So these are immutable, they work perfectly. Indexing, you mean to go find tile 12? Not like finding the tile that has this value in it? That is correct. Usually the query is a slice. It's a range query. And based on this range query, we can use the R tree to navigate to the appropriate tile that has candidates. Then of course, we need to get back open to compress further filter. And this is what we do, vectorization, another kind of stuff. It's a filter query, effectively. But do you have an index? So it's faster. All right. Taking a little bit of time, I can go through, Andy, I have this idea. I can go through the features very quickly because I don't have any more technical details. They're mostly feature stuff. And then I can spend just one, two minutes on the future of data management in my mind, just again, a couple of thought-provoking arguments there, and then we can wrap it up. All right. So this is how we build this, so that you can see also the products, open source versus close source, and so on and so forth. These are the backends. This is what we support today. It will extend indefinitely. If there is another one, we will add it. If somebody uses it. We have seven APIs, all super optimized. We have integration with Spark, Dask, Presto, you name it. MariaDB, we have a lot of integrations, efficient integrations. And we have Talibe Embedded, which is the open source storage engine, and it will always be open source. Then we have Talibe Cloud, which is the platform serverless. It has the access control. It has the database stuff. This is what you would do for enterprise grade data management, or if you want to share stuff in the cloud. And the shins, all of that is taken care of by us. You can find Talibe Embedded on GitHub here. This is what does all the slicing, all the compression, the scheme of the creation, all of that. It works on the cloud and all of that stuff. It's columnar, fully paralyzed, building C++. It has rapid updates and versioning. It's lock free. That's important. It's eventually consistent, which is more than enough for the use cases we are targeting at for now. Extreme interoperability, as I mentioned before, and very optimized for cloud object stores. So this is the open source part. And the cloud can work anywhere. Of course, right now, it's on Amazon, but you can build it on-prem if you want to. You can go here to check it out. It's completely serverless. So UDF, slicing, SQL, you name it. We have a lot of sophistication on computations. And we log everything. We share everything. You can launch Jupyter notebooks so that you can do analysis a little bit more easily. It's geo-aware. We have multiple regions. If a request comes for an array in Singapore, we send the compute to Singapore. So we have all this stuff automated. It's secure. We are under penetration tests, compliance, all of that stuff. But here's the cool thing about this. Everything in Talibe Cloud is an array. So all the data, but also machine learning models, Jupyter notebooks, dashboards, UDFs, code. Code is data in Talibe Cloud. You store this in an array. And what this gives you is that everything becomes cerebral. Everything becomes log. Everything becomes versioned. Everything in there. And everything becomes monetizable because we have this integration with Stripe and everything. So you can sell a notebook. You can sell a UDF. You can sell data, of course, and so on and so forth. So this kind of recursive architecture helped us a lot. That's why I told even blobs, even files, we can store as arrays. And then you inherit everything we build on top. And regarding the compute model, again, very quickly, I don't want to spend too much time. We are building on task graphs. So we're not just constrained on operator trees for SQL. Operator trees are task graphs. But we also expose those. That's the important thing. We expose the task graph building and execution so that you can build anything. You can build your own group by query entirely. And it's going to work. We take care of everything, the logging, the warnings, whatever. You can share those. It's completely flexible. You can build anything you want entirely. But everything is treated as a database, as data for a database. That's the big difference from everything else out there. It's very extendable. It's built in C++. But I said, there are multiple APIs. But the important thing, this will pave the way to my next slides, is that what we realized is that in organizations, although they will need group eyes, they will need linear algebra, they will need some specific operations. It's really unpredictable what they want to build. They want to build statistical models. They want to build their own machine learning stuff. They want to build stuff that you cannot accommodate them with a single system. Again, the database is universal in terms of versatility and what it can handle. But you cannot customize for every single use case because it's completely unpredictable. But if you give them the platform and you expose everything, you log everything, you provide the security, the sharing and all of that, then they can build their own stuff that can collaborate, they can share, and this can exponentially grow. When you say a good bet is to build as much as possible with C++, how DB is doing this? Or do you assume other people are going to do this? This is what we're saying about, this is about the recipe of building a universal database. This is what we did. A good bet that we placed was to build everything in C++, only because it's super extensible. You don't assume data science is going to be able to pop up in GDB and debunk the shit? No, no, absolutely not. This is about building the database. We're saying that the core, if it's not only us, we're not advocating in monopoly. If you want to build a database, a universal database, start with C++, it's a good bet because of the interoperability. But then expose Python, R, Julia, expose all of that because it's exactly the opposite. You can't expect those guys to build in C++. They don't have time for this. They can do it, but they don't want to. So seeing how, I know you want to get to the good stuff then, but they see how Rust has matured in the last couple of years, would you still make this argument that C++ is the way to go? This is quite of a personal opinion. I don't want to buy us too much. It's just that it's a good bet if you want to be very extendable. But if there are other languages that are like Rust, sure, that they even go, that they are getting a lot of traction and they cannot interoperate, it's all good. Performance and interoperability, anything that gives you performance and interoperability. It's not, no, I'm not going to just advocate C++. C++ is a good bet, but anything with performance and interoperability. Okay, keep going. All right, so what is the recipe? And this will close this chapter. So for us, the way we build it is the array model plus generic task graphs fully exposed so that the users can do whatever they want plus extreme interoperability in order for it to be future proof, like future proof. So that's the recipe we follow. So you can see now this is very different to ORM, you know, SQL. This is our recipe. That's what we do. All right, so tons of use cases I'm going to skip. Again, these are covered and they're going to be covered in future webinars, tables, machine learning, a lot of geospatial, a lot of genomics, marketplaces, and collaboration communities for all of those use cases. But the last slides that I'm going to show are this in terms of, you know, how I see the future or at least how I want it to be. This is not what's going to happen. This is what I would like to happen. Okay, so that we're clear. So one of the predictions I have is that if the universal databases are proven to work, if tally be successful, then they will subsume warehouses and lakehouses. This doesn't mean that the warehouses and lakehouses are going to die. That means that the warehouses and lakehouses will become universal databases. That's what I mean. So that we're clear. They will adapt to the new game and they will try to do the same thing because it's proven to be successful. Right now, we're trying to prove it to be successful. And the reason is that universal databases can do everything that the warehouses and lakehouses can do. It's as simple as that. Everything. If there is a lot of engineering, please don't get me wrong. Tons of work to be done. We're building the team, very capable team, features every day. It's a lot of work, but foundationally, there is nothing that the universe, that the warehouse or the lakehouse can do, that the universal database, the way I defined it, cannot. That's what I'm saying here. But the way I view the future or at least I want to view it is as follows. First of all, I don't want us to start building a new system for every single twist. Like right now, there is, you know, you invent a new hash cable creating new database. This is very incremental. We're just, you know, we're creating this mess that I mentioned in the past. So no more, not a new system for a new index or for a new twist of the database that you have. So what you should do is build components of a universal database, build the universal database and then build components, monetize them if you want, license them. That's fine. I'm okay with licensing. I'm okay with open source too, but you don't need to build another and another system just for every single incremental. Eventually, I'm really hoping that even termed universal is going to disappear. I find it quite obnoxious if you ask me and very audacious, but I'm trying to make a point. So I'm hoping that in the future, all databases will be universal by default. That's a database. They're not going to say universal database. You're going to say database. And that's it. It's going to be able to cover all of that stuff. And I'm really hoping that all of us will spend our energy on the science part or the analytics part and not on unnecessary engineering. Engineering is great, but not unnecessary engineering. That's my hope. And then the other hope is that we will build a massive collaborative data community. So by collaborative, I mean that you should be able to share the data. You should be able to share a runnable code. You should be able to share insights. Everything should be collaborating. Up until now, nothing was collaborating. We had the monoliths. And each organization was building its own environment for this monolith. And that's it. There are a couple of initiatives now. You can see a couple of warehouses that are trying to open up. But still, you have a deployment that needs to happen by an organization and maintain this deployment for the entire world. And this doesn't scale. So the deployment needs to be pushed to the database company. Database company should manage the architecture for the whole world to collaborate. You cannot expect a big hospital or a telecom or whatever it is to build an infrastructure to support millions of users, to share data, to share insights, to share data. So build a massive collaborative community and also enable brilliance. So far, Andy, correct me if I'm wrong. You've been in the space forever. Super successful, but correct me if I'm wrong. The focus of databases, even when we write papers, is on us. How brilliant we are to build another joint, another group by to build another OTP or another all of it's on us. The focus should shift to the users. We need to allow the users to do great stuff, to create the new machine learning models, to create new insights, sharing insights. The focus should be the users. So if you ask me, the future of data management is the user. And that's what we're building. We're not building the whole database with a focus on the database operators or the linear algebra operators, although we will build them. The focus is on user defined functions, on exposing task graphs, on logging everything, on sharing everything. We're building a platform to enable the users to shine. So these are some more resources. Follow us on Twitter, check out the website, docs and blogs, and you can find pretty much everything. And with that, thank you very much. We're a little bit over time, but I think we did a good job here. Okay, so I thank you so much. I'm glad to have an everyone else. So we have time for maybe one or two questions from the audience, depending how fast they are. So if you have a question, Demetrius, you have your hand up, go for it. Hi, thank you so much for a really nice talk. Just a question regarding the versioning. So when you are versioning, do you preserve the copy of the Bruse array and then recreate the new array? Or do you find the differences that exist and update the metadata associated with the array? How does it really work in that perspective? Yeah, this is a great question. Thank you, because I'm going to touch upon something that we don't do at this point. And it's in the roadmap, but this needs to be said. So the way I showed the format and everything, what Aldibi does is it writes the differences. It never updates. Like update is an update, it's not supported, although we will do it, it's not a fundamental issue. For the majority of our use cases, the writes are kind of appends. Not appends at the end, like more data being written to the array, even simultaneous, no problem, but always new data. It's absurd, effectively. It's actually insertions, insertions that either override or create duplicate. This is what it is. So effectively, the user writes just a diff, and it's not a diff, it's new data. It's not a diff, it's new data. So when the read comes, it takes into account all the fragments, all of them. So it's going to either give you duplicates or it's going to override based on the time, based on the schema, but it's effectively, you know, everything is considered. And then you can time travel, you can say, okay, I want to see what happened only last or what happened in the middle, like you have a lot of flexibility. Updates and deletions are not supported just yet, only because we wanted to get a log free architecture right on the cloud with eventual consistency. That is the biggest requirement for our use cases. They do massive ingestions, massive, and nothing should corrupt. So we're all for atomicity. We're all for recovery. We're all for eventual consistency, because that's exactly what the use case is about. So you're going to see very soon updates, deletions, and then still you're going to see the updates as a deletion followed by an insertion very similar to other like LSM tree like systems. Thank you. Thank you very much. Okay, I have a lot of questions. I don't think we have time for them all. I guess, what does the execution engine look like in TauDB? Is it like, how do you support different hardware? Like how do you actually queries? Are you using SIMD or are you using just having query compilation? Is your thread scheduled using the Intel TDB or whatever it's called now? Like what does the engine look like itself? So right now we have, let's split it into two parts, right? We have the core C++ library, which is essentially it has its own execution engine, but only for slicing, right? Slicing, filtering, decompressing all of that stuff. It's a query, but it's a slicing. It's one query, slicing, right? And then we have the task graph infrastructure for TauDB cloud. So one is in the open source storage engine, TauDB embedded. And then we have the task scheduling, which schedules, you know, it gets spot instances from Amazon, it deploys Kubernetes pods, it has retries, you know, all this infrastructure. So it's mostly management of tasks. And the task could be anything, mostly Python and R, user defined functions. So that's what we do on the cloud. Now on the engine, we used to use TBB threads, we don't anymore. Now we use STL threads. And we had some issues with TBB. Now we have new members joining the team advocating TBB threads again. We're not using, we're not, I remember there were some technical difficulties and we took a decision to switch from TBB threads. I'm not going to be surprised if we bring them back, to be honest. And that's why I don't want to overemphasize on this. We may go back to it. We have some people advocating for it. Vectoralization happens at the compilation time at this point, like we're trying to help the compiler vectorize the, vectorizable code. We don't do any specific stuff at the moment. Everything is multi-threaded. We use threads everywhere, compression, filtering, you name it, for the estimation of the results, a lot of different stuff. So quite basic on the storage front, only because limited computation capabilities here, we delegate to MariaDB, to PrestoDB, to engines we integrate with. And then in the platform, it's all about task graphs, task graph scheduling, resources, elasticity, more DevOps kind of stuff that are quite important at this point. Less on building the actual operators on TaudDB cloud. And this is what's coming up very soon. So you said you delegate things to Presto and MariaDB. So what does that look like? You have a plug-in connector to talk to TaudDB and the cloud DB is just serving data and do some basic pretty good pushdown? Exactly. We push down as much as possible. Currently, we push down filters on dimensions and attributes. Dimensions are great. They have always been great because massively shrink the space. But then very recently, we started pushing attribute filtering as well and we're seeing some massive speedups. Again, because less data copying, less vectorized as possible, stuff like that. So we saw massive speedups. So you're not like you support SQL, but you support SQL through Presto and MariaDB. So you're relying on their query optimizer, their planner, to make all those decisions. For now, for now, they expose the tree to us, the AST to us, and then we carve whatever we want. And we will do a bottom-up approach. We're not building a SQL engine only because we are doing multiple different things in other domains as well. We wouldn't have the resources to just focus on SQL. So what we do is we use MariaDB or others and then we start pushing down more and more primitive computations because this is how we see the speedup. So you're going to see grew buys. You're going to see some joins. You're going to see some stuff happening in the engine. If you're doing joins, then you have a cost model, right? You will see. You will see. Not yet. But yes, yes. Soon, of course, and the reason for doing it is the cost model because we have all the statistics. We know we have past statistics and we have a lot of stuff. So we can play a little bit with their query optimizer there. Actually, my last question has been, it's so hard to do cost model estimation for queries on relational database, two-dimensional. Is it harder? Is it multi-dimensional or is it the same kind of problem? Same kind of problem. Unless I'm proven fastly wrong, only because we use an index, the dimensions, either it's one or four, it's an index. And based on that index, we already do it internally for certain things. We have estimators and they're pretty good. We can estimate how many, for example, we can get how many tiles we're going to touch. That alone is massive. That piece of information, how many fragments we're going to touch, how many tiles, which happens in memory, we don't need to go to S3. That alone is such a big piece of information that allows us to gain a lot of performance.