 The Carnegie Mellon Quarantine Database Talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. Hi guys, the pandemic rages on but the election is over so that's super awesome. Today we're excited to have Todd Pearson from ARADB. Todd is the CEO and co-founder of ARADB where he's been working on this since 2019. Prior to that he was the CTO and co-founder of Influx Data, the creators of InfluxDB. So again as always please if you have any questions for Todd as the talk is on please unmute yourself, say who you are and where you're coming from and always we want to thank the Stephen Moy Foundation for Keeping It Real for sponsoring this event. Okay Todd the floor is yours, go for it. Thank you. Awesome thank you for having me. So yeah so this talk is called Designing Systems for Cardinality and Dimensionality. As Andy mentioned I'm Todd, co-founder and CEO of ARADB. Yeah, happy to be here. And let me get PowerPoint. There we go. So quick agenda, so go into a little bit history and background. Talk about some of the stuff that I saw, learned and did at Influx, kind of how it informed some of the some of the stuff we're building now and some things that we think about as sort of ideal database properties that we want to be focusing on. And then kind of talk about what ARADB is just a little bit towards the end. I don't want to make it too too focused on pitching the product, it's more about the sort of philosophies that we've evolved over the last year and a half and then talk just a little bit about sort of future goals and things that we're still working on and stuff that we want to try and tackle as we as we grow. So yeah, a little bit of history. I previously co-founded Influx data and when we built it initially, when we started the company, we weren't really planning to build a time series database. That wasn't the goal. It was actually we were building a SAS tool for observability and dashboarding and we actually built a time series database as part of that SAS platform. So Influx DB actually came out of a lot of the work that we did to build a product that was going to need to leverage a time series database. We found a lot of the things that we tried looking at. This was an era when most people were still using Graphite for observability. There was no previous yet. Datadog was being built around the same time we were building this. So there was kind of a lack of tools in the space. So we essentially pivoted away from the SAS product and started working on Influx DB itself in 2013. And it was initially, it was a bunch of just like Scala web services on top of Cassandra is kind of where we had started. And then we by the time we open sourced Influx DB, it was we'd written it and go and we actually built the storage engine based on level DB was kind of the starting point there and gave us some some room to experiment to figure out what we needed. And we ran into some just growing pains with level DB and switch to Rocks DB. And you know, it was good. It got I think it got us relatively far, but we realized after a little while that we were going to need to build our own storage engine. And I know Andy, I think Paul did a talk. My previous co-founder Paul Dix did a talk about the Influx storage engine probably 2017. I think is when it was, but it was a while ago. Yeah. So we started building the the TSM storage engine in 2015. And it's I mean, it's gone through a lot of development, a lot of productionization, but largely the design of that storage engine is is still pretty similar to the way it was in 2015. It hasn't really evolved a ton since then. So a little bit about that storage engine, because that's kind of seeing seeing the the development of that and sort of seeing the problems that we ran into, the things that did well, the things that did badly, sort of informed a lot of the of the things that we're we're looking at now and the things that we want to do with our DB. So the way that the storage engine works is it's kind of it's like a like an LSM tree. Each individual series, each individual time series is a column on disk. So there are there are blocks of those, but each each time series is essentially a set of tags, a set of key value pairs, string key value pairs that basically represent a unique series that can be targeted with a precise set of tags. And then each field that you specify, and I don't know how I don't want to get too much into the syntax of of Influx DB, but essentially you can write data that's either a Boolean, a float, an integer, or a string as a field that's identified by a given tag set. And those are stored in compressed blocks that are basically, you know, a column oriented data store. That, you know, works pretty well. It gets decent decent performance. And the the query language sort of allows you to specify sets of series to either query on or do aggregations on by using an inverted index. So essentially, each tag has a set of values that are that have been written. And for a given value for a given tag, that inverted index tells you which series match. So essentially, as your as your running queries, you know, that inverted index tells you which columns need to be fetched from which blocks for whatever time ranges, and then it can pull those back. And it's, you know, it performs relatively well. But we we kind of saw a few things as we started to get to larger and larger data sets that made it a little bit tricky for certain use cases. And incidentally, the design, if anyone's looked at the the storage engine for Prometheus, the the storage engine that they wrote for Prometheus to originally Prometheus was in memory, but the current Prometheus storage engine is very similar. Like it took they took a lot of the same design principles, but they simplified it so that only it only source floats. So there's no bools in or strings in the Prometheus storage engine. But if you've looked at that, it's it's very similar the way it's structured on disk. So some of the things we saw within FlexDB as as people started to use it at larger and larger scale. The interest performance was was good, but it wasn't it wasn't great. They're actually a handful of design choices that made it difficult to scale out. So one of those was basically, you know, in the middle of the right path was a right ahead log. So at f sync to disk for for durability. But there was essentially a hash map that had a single each each series ID or each series key set of tag keys was an entry in that in memory wall. And so that was the the memory portion was used for querying. But essentially, it created a hotspot for right. So there was actually a lot of contention around that. And at a certain size, those we get flushed to disk, and then those we get compacted into the the TSM storage engine. But it actually made it really difficult because the series series idea sort of like a right concept created a lot of contention where you had either a lot of series or some series that were getting written heavily. And then we, you know, as I mentioned earlier, wrote in FlexDB in go, and garbage collection and some other memory overhead things of the of the language actually created quite a bit of difficulty when we started to get to larger numbers of series, larger, larger right volumes. Another issue that we ran into is that because of the way the the concept of tags was it was a relatively simple way of saying, you know, if you write something that's a tag, it becomes part of the series ID essentially. But that's how data is indexed. So you essentially have a series, sort of a measurement, a set of tags that are associated with it. And if you write a tag, it's part of the global index. And if you don't, it's it's the it's sort of a field name that represents a column or store on disk. And that was that was decent, worked worked well for a lot of use cases. But it was having only a single way of indexing data was actually a big limitation. There was there was no way of either building secondary indexes or offloading things from the primary index when you either couldn't couldn't fit all the index contents and memory. Another problem is that the index as you started to tag series that you were writing in, you kind of had to decide at right time is this is this attribute truly a tag that needs to be indexed, which is going to add to sort of that the size of that global index, or is it a field that I want to store in, you know, in a column oriented format that I can then query back and you could filter against it, it's just you can't filter against it with an index. And so as as an end user, as you're writing data, you kind of have to make that decision, you either write it as a tag, you write it as a field. And it's a it's a pretty significant difference in terms of performance. If it's a tag indexing is is fast and easy. If it's a field, you essentially have to scan through those compressed blocks. When you want to filter on that with queries. And then the the kind of ultimate downside there was that as you make these decisions and you say, okay, this is a this is a tag, I'm going to it's going to be indexed. If you change your mind, it's very painful. Like this is sort of like permanently baked into the the TSM files as you start to write this data. So there's no, there's no easy way to say, oh, this thing that's a field, I actually want to be a tag or vice versa type decisions. So if you write something that wasn't into it becomes a float, you end up with essentially this, like, split brain problem where in some places it's a it's an in some places it's a float, you can have a lot of type collisions. And I think there were even some things that we had to add to the query language to say, I sort of accidentally wrote this in two different ways, I actually want to select the the tag or field that is a float, not the one that's an end. So you kind of had to work around some of these problems. But essentially, the underlying problem here is that the data that you're that you're wanting to store and compress and the metadata the tags about the series that you want to be able to filter on are essentially very highly coupled, like they they coexist in the storage format. And there's no, there's no way to modify that. And it's sort of as a as a negative, the biggest negative is that in a lot of cases, because the way that these are are indexed as like a single global index. If you happened to as a as a user, even as an automated system, start writing data for a particular tag that has a ton of values, those values all contribute to the size of that global index. So maybe you write a tag that has 10 million unique values. Well, now you've written a bunch of this data that takes up space in this memory, and you have no way of getting rid of it. So you actually can blow up the indexes on a on an inflex DB machine, and then make it very difficult to actually get it started back up again. So there's not really any easy way to to work with that data. Once you've gotten to a point of having done something, something bad that you wish you hadn't done. So anyway, there are a lot of these things and, you know, they have, they all have their their solutions and workarounds and you can you can introduce some tooling, but none of these were easy. And they all kind of added added up to be a painful user experience for a lot of use cases. And so, you know, kind of summarizing all those things, what what what did we learn from that? What are the things that I sort of took away as I left and was thinking about what what I wish we had done a little bit differently. So high cardinality data is is hard to work with. And I'll talk a little bit more about the cardinality and dimensionality. But essentially, those tags that I was describing where you could potentially write, or have an attribute that has a lot of unique values, that that kind of, I think, to sort of estimate, I would say, once you have a tag with more than like a million unique values, you're definitely looking at a highly cardinal attribute. And that becomes very difficult with inflex DB and a lot of other systems as well, because it's it's trying to index something that has a lot of possible values. Metadata management is not easy. So in in a lot of systems, you sort of end up with the metadata and data coupled or stored together in the same system. And that can be very difficult when you start to want to extract metadata away from the data and be able to to manage it or work with it without having to change the underlying data. And in that case, we're talking about inflex DB being able to convert convert between tags and fields is was basically impossible without doing complete rewrites of the files on disk. And in inflex DB, you know, it sort of builds itself as being schemalists, because you don't have to define a schema. But you sort of have this situation where once you write data, a schema sort of decided for you, it's it's like a right time schema. And there are a lot of unintended restrictions that come along with that, especially in systems when you're not there's no human in the path of the data, no human kind of looking at the data that's going in and deciding is this the right type or should it be a different type. And more and more systems, I think collect data automatically, data, you know, attributes can get added without users or developers realizing it. And you sort of ended with schema decisions being made for you without realizing that they're being made. And sort of as a like as an extension of that, working with data when you don't know the shape of it or the nature of it, and this, you know, kind of going to a lot of systems that work with observability and things like that. You might be collecting data, whether it's through scrape or just through ingesting data that's coming in. And you're not necessarily looking at it all. So being able to store that data and do do meaningful things with it can be very difficult when you don't really know what it looks like. And I think that's happening more and more as data volumes are increasing and data complexity is increasing. And then just kind of looking at, you know, even the basic hash map that inflex DB is using is sort of in the middle of the right path at a certain size, you know, traditional index structures can struggle to either perform or fit within the constraints of like a given memory profile. And, you know, obviously depending a lot on how you implement them and what language you're using, but there were a lot of times where doing things that should have been relatively simple became hard because of either because of goes garbage collector or because of just the implementation decisions that sort of had to go along with it. So at a certain scale, not everything works as well as it does when you're looking at things in a narrower scope. So we basically, as I left, you know, I mentioned I went back to Pivotal for a while. I was working on some observability systems there. And we just, I started thinking a little bit more about how do you take some of these problems that we saw within flex DB, you know, obviously there are things like, you know, sharding and strategies like that that can work for dividing up some of these problems that are about scale. But I kind of wanted to find some better ways to solve them in a more, I don't know, in a bigger way. That wasn't just about dividing the problem up, but actually trying to find some novel solutions. And that's kind of what led to us starting our DB. And so we, my co-founder, Robert and I started talking about, you know, what are the, what properties do we want to try and find and identify in a database? Like what are the things that we wish were just solved? And then, you know, upon that, you can build a lot of different, built for different use cases. So we essentially said that, you know, looking at the way that users approach high cardinality data, maybe they're not even aware that it is high cardinality data, that the number of attributes for a given tag is going to cause difficulty in their system. We kind of felt that should be something that just gets handled natively. Isn't something that should require user intervention to manage well or require users to, you know, reduce the complexity of the data or think about whether something should be a tag or should be a field. There should be a way to store all that data and still make it possible to query well. And then sort of as a parallel to that, we thought it should also be able to handle high-dimensionality data. And I'll talk a little bit more about this on the next slide, but essentially data that has a lot of columns or attributes, like not just 100, but thousands or tens of thousands or hundreds of thousands, that should be something that users can work with effectively as well. And we wanted to try and see something that was truly schemaless where even if you wrote data of two different types for a given field, maybe the data evolved. Maybe it was a mistake. Maybe there was just some temporary period of time where data was written in a different format. We wanted to be able to let that data be stored, first of all, like let it actually be collected without being rejected as right, but also make it possible for that data to be effectively queried and explored. Just because you have data that maybe doesn't have the same type over time doesn't necessarily mean that it was wrong or bad. So there should be some way to work with that data. Being able to scale to large data sizes, so terabytes just aren't that big anymore. We wanted something that we could see being able to scale to petabyte-sized data sets. And then fundamentally, seeing a lot of systems out in the wild, we wanted something that could actually be relatively straightforward for users to deploy and manage. Talk to a lot of people running things like elastic search at scale, like 40, 50 nodes, and just seeing the amount of time and resources that are required to keep the systems online was something that we felt should be a solvable problem. Getting back into the first two a little bit, I wanted to just come back to these two things. They're a dual of each other, cardinality and dimensionality. Cardinality, just to recap, is the number of unique values for a given dimension. If you think about something like, let's just say we're collecting metrics for a bunch of different VMs. The IP address, if you choose to index that, let's say you have 100 different IP addresses for a given deployment, so that would have a cardinality of 100. And then as you start to add other tags, maybe there are three services running on each of those IP addresses. So maybe there's a MySQL, a mail server, and I don't know, some sort of other disk metric or something like that. So then if the system has a cardinality of three for each IP address, now your cardinality is 300 because each one of the systems will be paired with an IP address for the data that gets reported. And you can kind of see as you start to add more dimensions and more individual systems, let's say you have 32 CPU cores and each one of those gets reported independently. Well, now just for your CPU, you've multiplied the cardinality times 32. And that happens a lot in observability systems and other systems in general, whether it's industrial IoT or even distributed tracing or logs and more and more the attributes that are added to this data grow and grow. So we see a lot of systems where without really realizing it, you've sort of gone from what maybe several years ago would have been a relatively constrained cardinality dataset. And now as you start to sort of get the multiplicative effect of all these different attributes, you realize that you're reporting millions or tens of millions of unique sort of time series under the hood, if that's the kind of data that you're collecting. And it becomes a larger problem because depending on how you index this, depending on how you want to query it, you end up with a relatively large set of given possibilities for the data that you want to be able to filter and explore on. And that's sort of cardinality in nutshell. And then dimensionality, as I mentioned earlier, is the number of distinct dimensions. And so in my other example, if it's just IP address and service, maybe you only have two different dimensions. But more and more, we're seeing these higher dimensionality datasets as well. And so higher dimensionality can lead to higher cardinality. It doesn't necessarily have to, but we see these two different things as sort of continuing to grow both on their own, but they also have an interesting sort of interplay that I'll talk about a little bit more in just a bit. So kind of talking about high cardinality and how, some ways we could think about it, some ways that we can solve it, sort of how it feeds into our problem set. And so one of the things that became a problem, and this is less to do with just the peer, how do you store an index for this? How much memory does that index take up? What does the inverted index look like? It's just fundamentally, as you have a high cardinality dataset, and you think about influx of storage format, like I said, it's column oriented format. So let's say you have this inverted index, you can use that to sort of find the series that you want to look at or filter on. And if it's 100 series, that's cool. You can go find your 100 sort of compressed blocks and pull them back and do whatever aggregation or other addition that you want to do on them. But as that number starts to grow, as a number of series you pull back starts to grow, it becomes harder and harder to actually retrieve that. So sort of as a thought experiment, if you somehow, let's say the cardinality of your system is, I don't know, maybe 100 million, and you want to pull back 1% of those series. So let's say it's a 1 million distinct series that you want to be able to pull back. Well, each one of those exists as an independent column on disk. So some of them may be in the same blocks, but within those blocks, you have to jump to each of those series to find them. So let's say each of those series requires one seek for retrieval. And in SSDs, I know there's not like a seek, but there's still a sort of an access time. So let's say for an SSD, you can do it in a tenth of a millisecond per series and jump to each one of those to pull it back. Well, there's still 100 seconds of sort of bare minimum retrieval time to be able to pull those back. And as you start to think about just sort of naively placing these columns in blocks on disk, it actually becomes performance limiting. Like you really get stuck actually retrieving the data because you have so many of these unique columns on disk. And with the way that a lot of, like with the way inflex DBs storage engines is designed, there's really sort of no way around this. So you'll see, even if you have data that's indexed, as you start to pull back more and more series, it just becomes still a very slow problem. So one of the things we started looking at is saying, well, you know, in worst case, like there's really nothing you can do about this if you are just sort of using a generic column store. But what if we think about some ways that we can actually group data together? So we started looking a little bit at using some machine learning classification and saying, what if we actually find patterns in the data that we think make some of these series or some of these data groupings similar, can we co-locate data so that maybe you're pulling back, you know, a set of a thousand related entries or something like that at the same time? And it's still early and still something that we're looking at. But this is kind of one way that we wanted to approach cardinality and say, what can we actually do to make some of these retrieval problems a little bit easier or a little bit more attractable with the hardware that we're using? And so kind of jumping off of cardinality, I want to talk about. Yeah. Before you get away from that one. So the idea is what, so you're assuming that like people are running dashboards where the queries are repeated. And it's just maybe like the parameters that are used to look for different ranges may vary. So do you then find, I guess, clusters of columns that are used together, you pack them together into like a same set of pages? Yep, that's right. And it could be, so some of it can be just purely looking at the data, like what data looks similar, and some of it can be looking at the query. So what can you build patterns on the queries that users are running? So there are kind of two sides to it. It's still kind of an area of active exploration for us. But we found some benefits that can be had by doing that by actually grouping data together and making it easier to pull back like a larger chunk of data rather than a bunch of smaller chunks of data that are less tightly coupled. So I buy that like, okay, I can co-locate columns that are used together in the same pages. That's a known thing to do. What do you mean by if the data itself looks similar? Like because you can then compress it better or what are you trying to do with that? So looking at data where like maybe looking at data that has similar attributes, has like the same set of repeated attributes or attributes that are clustered together in the same, I don't know, value space and building models based on that. But to do what? Like just do summarizations or actually compress it better? Compress it better, yeah. Yeah, compress it better. Yeah, located in the same file. So and some of this stuff, I think, to sort of step out a little bit, we're not necessarily dealing with all time series data here. Like that's sort of my background because it's been FlexDB, but we're not necessarily saying that all this data is even going to be put on a dashboard or things like that. But yeah, the idea is better compression is kind of the goal. But you use the word model there and that means something specific in the machine learning world. That's just for the classifier and not like a summarization. And then so stepping into the sort of high dimensionality problem a little bit, we kind of wanted to think about how we actually sort of conceive of the data inside of RDB. And so as a really simple example, here are two rows of data, four different fields, and this is what you get from like a relational database, just selecting two rows. But what if we take it and we sort of exploded it into its most highly dimensional form. So we actually, for each field and value, we have a single dimension that's either true or false. And so with this representation, we actually can start to think about the data a little bit differently. And so in dealing with this data, when we think about being able to work with some of these dimensions, there are some tricks that we can start to employ to be able to actually manage highly dimensional data more effectively. And if we can, so we essentially said, what if we are able to handle the highly dimensional case, which is potentially a little bit harder and more complicated than handling just a purely high cardinal space? What can we do? What can we do differently? And so we kind of, let me jump to the next slide and kind of look at this a little bit. But so we said basically the exploded form of the data is more flexible. And you now have values that are all mutually exclusive. And so by building a database that can work with this, we've sort of encapsulated the high cardinality problem. So high dimensionality, if we take and think about each one of those possible values as its own dimension, lets us not have to think about any one particular dimension as high cardinality. Each one just has sort of like this Boolean representation now. And we can actually use some machine learning techniques that work on dimensionality reduction to start to look at this highly dimensional representation as a space of values that we can work with in different ways than just thinking about it as like a pure storage representation. So kind of jumping off of that, we have started looking at, there was this paper, I'm sure a lot of you guys have seen it, but this paper from 2017 came from a bunch of folks at Google. Jeff Dean was one of the people on the paper. But basically what they said is, let's look at traditional indexes like a B tree. And let's see if there are some better ways that we can actually model this data as an index that better reflects the patterns in the data and could potentially outperform traditional index structures. So they looked at a bunch of different index types. B trees were kind of the ones that they spent the most time on. But they looked at range trees and hash maps and some point indexes. But basically they took individual indexes, multi-stage indexes, and they essentially found that one of the biggest, one of the problems they started talking about initially was that they started looking at this stuff just with Python and TensorFlow and actually found that the invocation overhead was too high. So they built this thing called the index learning framework, where they actually can build these models, turn them into C++ code that they can then run directly and say, how does the performance actually compare? So they sort of built a bunch of cool tooling around this stuff and looked at a bunch of different use cases. But here's one of the kind of breakdowns from the paper. But they basically found that if you compare some of these learned indexes that they're building, and there's obviously there's like different kinds of search algorithms and things like that that you can use within this, but they essentially said there are ways that we can not only create an index where the lookup is faster, but we can build indexes where the overall memory requirement is smaller. So you kind of win on both fronts. And they looked at, I think in this one, yeah, so let me jump to the next one. So here's a chart. They had some web log data and they had a log normal data set and they had another one that was kind of just document IDs from sort of like a web index. But they looked at a bunch of different data types and sort of found that depending on the parameters that you tune with, you can actually get pretty decent performance out of this. And it was obviously like a very early stage paper and these were sort of some of these were synthetic data sets and things like that. They're obviously limitations to how far you can take some of this stuff. But we looked at this and we actually said, there's some interesting stuff here. There's some promise here and they kind of took this in an interesting direction. So why does any of this stuff actually matter? Well, we saw some places where learned indexes can actually outperform traditional indexes and that's interesting. We've already been thinking about machine learning as ways to sort of figure out how we store data or retrieve data more efficiently. But kind of going back to some of the things that I saw with the indexes at InflexDB, being able to significantly reduce the size of indexes is actually a big deal. And I think I mentioned earlier, but one of the things that we saw is once your index gets to a certain size, you actually see boxes running out of memory and crashing. And if you were able to apply some of these techniques to high cardinality data to reduce the index size, you could end up with some pretty interesting outcomes there. And it sort of is going in this direction of workload personalization. So if you think about a generic database with generic indexes, there is maybe some argument to be made that by adapting to different workloads, you can actually make the database more performant, make the index structures more performant. One of the bigger things that they pointed out is that through building these models and training them, NVIDIA is projecting a thousand X increase in GPU processing power by 2025. So obviously, if that's something that we're able to leverage, if that's something that's actually happening, being able to build and use models for these kinds of indexes is obviously interesting as GPUs start to outperform CPUs more significantly. But the big takeaway is that they showed some use cases and some sample data structures and some ways to pair these things together. But they kind of said, there's a lot of additional research that can be done here. This is just the beginning. And so we kind of took a lot of that and started thinking about ways that we could use that inside of AirDB. And so as we started thinking about it, and this is kind of Andy getting to your question about super indexing and what is it? And it's sort of, for us, it's a set of, it's sort of like a family of techniques that we think build on some of these things that other folks have come upon. And so one of the things that we started looking at is, in data sets like the ones that we see in inflex DB, you have to decide between a tag and a field. Well, what if some of those fields were actually still had some sort of other index on them? Is there a way that you can use some sort of a machine learned index to improve on just having to do peer table scans? And can we use machine learning to build some of these membership structures for highly dimensional data? So similar to the way that you might use a Bloom filter and LSM tree to figure out what blocks are relevant, is there a way that we can use some of these membership structures for this highly dimensional data that we kind of want to think about to identify which attributes, which dimensions matter, which ones are relevant to a particular query. And as I mentioned earlier, we can use some of these dimensionality reduction techniques to sort of bring down the space that we need to look at. But at the end of the day, there's no free lunch. So while we see that the learned index structures are interesting, there are definitely places where they underperform traditional index structures. And I think one of the things that they pointed out in the paper was that you can still use these index structures but fall back to a traditional index structure if you find that you're under performing it, if you find that it's not a good fit. So what we kind of wanted to look at is, is there a way that we can use this not as our only index structure, but as sort of an extension to a lot of other things that may be more performant in certain use cases, but as a way these like these, this concept of learned index structures can help us to be more performant in certain use cases. So kind of as a test bed for this, we started working on a prototype and we said, we just kind of wanted to get to a place where we could see some progress with a minimal time investment. So we actually started experimenting with Postgres because it has the index access method extension framework. So you can actually build sort of external indexes with Postgres that it can use to reach out to. And we basically built this thing that we call the foreign index wrapper. And so essentially what it does is, as rights come in, they get, they get written into Postgres, but they also get written to this service that we wrote. So essentially through sort of the hooking into the right of head log inside of Postgres, all of the rights get written sort of all the tuples get sent over to this external API that use this, you know, learned index structure kind of idea, basically to build indexes against all this data. And so rather than making it something where, you know, you have to specify an index, we just took all the data and indexed every dimension that came through. And so there was actually another project that I can't remember the name of that we found while we were working on this that had a similar concept for the way that they did the foreign index wrapper. But if anyone's interested in talking more about kind of how we built this and how we hooked in, feel free to hit me up afterwards because it was kind of a fun project. And it's not something I think we're going to, you know, turn into a product or anything like that, but it was kind of a cool way to leverage Postgres. Was this an academic project or a commercial thing? No, it was an academic. I was, yeah, I think we just, we ran into it on GitHub. It might have been on hacker news at one point, but I think it was relatively academic. I don't think it was a commercial product. But yeah, I can, if you're interested, I can send you, I'll find it and send you a link. Yeah, I'd be curious. Yeah. So going back to this, like, so, I mean, the, they have newer papers on the learn index where you can, you can adapt to the update them, but the original paper was, it was read only was static, right? And so you're proposing, I haven't looked at the newer papers that are maintaining them online, but in your case, you still have to have some kind of sharding for like time ranges because you can't be updating the one that you've already trained. That's the, and that's the basic approach that you plan on going this, like within a time range, like here's the, you know, here's the learn index data structure that I can use for that time range. That's essentially what you're proposing. That's, that's one thing that we looked at, but we, we actually do have a design where it is, it's closer to the ones you're describing. I actually, I haven't read the, those newer papers either, but it's more along the lines of something that's constantly being updated. So it's essentially continually trained. But then also too, your previous slide said you, you wanted them to be, you're trying to do an approximate set number, or actually previous slide over this. Oh. For this. Yeah. Like you, like you don't want them to point you to the actual location of the data. You just want approximate set membership? Well, yeah. And so I think depending on the, and, and some of this is, I'm going to show some of my weakness here on some of the implementation stuff, because a lot of that is my, my co-founder is a little bit deeper in the, in this stuff than I am. But for, for some of this stuff, it actually, I think there's an interact ability at a certain scale. And I think the approximate membership gets us, is a trade off between, you know, models that are, that are too large or indexes that are too large versus data that we want to be able to pull back and can then look through it query time or at decision time. Would you want a Bloom filter that does rain queries? Maybe. And that, I mean, I think that's sort of along the lines of, of the stuff that we've been, been building. Yeah. So we have one. Okay. Awesome. I'll need. Okay. Yeah, definitely be interested in, is that, is that a paper? Or is that just? Yeah, it's a paper, we won Best Paper Award for it in Sigma, so. Of course you did. It's the only time I've ever won Best Paper Award. No, just the student, the student, the one chance is awesome. He did it. Okay. Yeah, definitely. I'd love to read that. Yeah. Okay, sure. All right. Sorry, keep going. Yeah, yeah. And then so, you know, this was a prototype. We, we didn't go nuts with trying to make this something that was going to be, like I said, we're not going to turn this into a product, but we kind of wanted to just see what it looked like with, with some, you know, real world data. But we basically found that, yeah, we were able to make this thing work. This thing is sort of like a foreign index worked. And we were able to, so essentially what, what we did is said, if Postgres has an index, if the, you know, if the query engine finds an index that, that matches well, it can use that. But if it doesn't, it ends up having to fall back to a table scan, it can use this external index. And so basically if we index everything, which in, in, you know, the real world is probably not ideal, but let's just say for the sake of the prototype that we index everything, we actually found that we could actually do a pretty good job of being able to improve on Postgres as just pure table scan. And you know, then this was a smallish data set. It was relatively restricted, but we were able to find some speedups depending on the type of data that we were using. And so it kind of gave us some, give us some optimism that there are ways that we could actually incorporate this work from the learned index structures into ARDB in a way that allowed it to be not the sole index for everything, but in addition for some of these, whether it's high cardinality or high dimensionality data, we could actually store and index that data and make use of it. And like I said, there were, I think there's a lot that we could do there. And if anyone is interested in talking more about kind of how we did it, just let me know because I thought it was kind of a cool project. But there obviously are like limitations like Postgres has its own limitations. And it's not something you can, you know, use on AWS because you can't put foreign extensions in. So it's got limited utility as a product, but I think it was kind of cool for us. So moving on, talking a little bit more about schemalessness and just how we want to think about ARDB as supporting data without having to have a schema. So one thing that we, and yeah, trying to decide on order of these next two slides. So essentially, we started thinking about how we get rid of the schema and kind of going back to inflex DB and elastic search and some systems like that. It's sort of like air quotes schema-less, but there's still a like first right wins kind of scenario where you don't have to decide in advance, but once you write the data, that type is fixed. And there are some ways to work around it. But it becomes kind of tricky. And then you sort of have this consistency problem where you have to make sure that if a new series is being created, well, is it the right type? And there are some sort of race conditions around like series creation and first time that you've seen a column. And it makes it logically makes the system a little bit more complicated to deal with. So we basically said, what if any, you know, syntactically valid right was fine, like whatever didn't matter whether the column existed or if it had a particular type. What if we essentially just said anything can be written as long as it's syntactically valid and then thought about like kind of what that opens up for us and how do we deal with that if there are type conflicts. So we essentially came up with this idea where we're essentially every column, you know, air quotes column, can have multiple types. And if it sees multiple types, those get stored separately. But what you can do is say each of those when at query time can be looked at independently, and you can use some some type coercion. So saying like maybe string one can be coerced to being the same thing as, you know, Boolean true or integer one and baking some things into the query engine that sort of are a little bit more flexible with how the types are handled at query time without necessarily making there be a conflict when there when there doesn't have to be one. And so obviously there are some things at query time that are not going to be able to be coerced into each other. But they're most of the time, that's something that's possible. And so we sort of built into the query engine, the little of a query engine, the ability to handle these multiple types, even if the it's the same column. And so that's kind of the direction that we've been been pushing towards and sort of came up with this, I don't know, newer taxonomy of schema. So we sort of have, at one end, the explicitly schema full. So, you know, relational databases, postgres, you have to define a schema. And everything has to has to fit within that schema. Otherwise it gets rejected. There's the implicitly schema full where you don't have to define it upfront. But there is a decision at some some point, what the type of that column is. And you know, I mentioned for inflex DB, there is a way to write multiple types, but you sort of have to be very specific with how you work with them. And in general, it's not actually very easy, easy to interact with. And then sort of saying explicitly schema lists is sort of where we want to be at the far end of the spectrum, where you can really write literally anything of any type into any field. And we'll accept the right. And at query time, we will handle that natively as part of part of the query engine. And we've kind of started looking at that as becoming an important part of our sort of like developer experience or user experience. And it has kind of helped us think a little bit more about ways that we can extend the system and sort of design use cases around this. This is something that we can, we can support. And then kind of that also feeds a little bit into scalability. Because by not having to have coordination around types or having to make sure that writes are valid based on previously written data, it's actually helped us simplify the right path in a way that makes it as coordination free as possible. So we can essentially say that any write as long as it is valid can be accepted. And the resolution of any other conflicts can happen later at query time. But in general, as I've said, like we kind of have designed it so that queries will course types to make queries possible. And so essentially, we've made it easier for the right path to be scaled out. And really the writes will only fail if there's some sort of, you know, catastrophic failure, like you either have an entire part of the system down, but not because of conflicts on type or conflicts on anything else that an additional any other user could do. And so that's been an important part of us being able to scale this out to handle higher write throughput, and make a system that that's easier for us to build and deploy. And kind of through this, as we started like looking at use cases, talking to potential users, we've worked really hard to try and decouple storage and compute. Obviously, that's we're not the first people to do that by any stretch. But looking at that as a set of use cases where we want to be able to deploy a lot of the stuff into a cloud native environment, being able to have the storage tier separate from the compute tier makes it possible for us to sort of build build the separate tiers of the database independently and make them independently scalable. And so that's something that where it has also become sort of like a guiding design principle is basically making these pieces independently independently functioning. And that also feeds into the next point, which is is this really a good idea to like, do people really want like, I guess, like, like, you're letting people shoot themselves in the foot, right? If they start putting garbage into a to a field, you'll take it, right? Like, you're not doing any integrity checks. Like, long term, is that a good idea? Now, if it's if it's metric, the sensor data, like, yeah, maybe who cares? Because it's like the, you know, you just feed you just feeding in this data from these different sensors. And it's programmatic. But like, long term, this is what burned a lot of people with the no SQL stuff is what burned people on CODESIL back in the 70s. Like, the racial model is, is, is thriving and sustainable for that reason that you that you have. Yes, you have to declare a schema, but the very least, like, you could throw an error and say, well, you gave me a million float to this column, but now you're giving me a Varchar. I don't think I want that and reject it. Yeah. And I think so I think your point is valid. And I think there is a class of problems where data integrity is the most important part. But I think there is a separate class of problems where being able to have that flexibility is actually more important. So I think if you, I mean, thinking about, let's just talk about like structured logs, like, you're, you're dealing with a system where, I don't know, whatever, whatever your subsystem is, is generating structured JSON logs. And you just want to be able to store that stuff. A developer makes a change that, like I said, changes a type or something like that. It goes from into a float. Should you start rejecting that? I don't, I don't know. I think there's a, I think there's an argument to be made that, you know, yes, garbage in garbage out, but at the same time, whatever comes in, if you assume that there's some rigor around it and that you want to be able to make it possible to store and work with that data, I think there is a class of problems where that's valuable. And certainly, yeah, you can let people shoot themselves in the foot. But I think I've seen people shoot themselves in the foot with sort of like intermediate schema type stuff also. And it's sort of the implicitly schemaful services. And I think for most of those cases, the end result would have been better if the system had handled it gracefully, rather than started causing some sort of problem in the middle or starting to reject rights because, you know, somebody deployed a change. So yeah, your point is fair. I think it's certainly not the right solution for every problem. But I do think there are problems for which it is a good and interesting choice. You're essentially robbing Peter to pay Paul, like the writer is not paying the penalty, the queried will. That's a fair design choice. I understand. Okay, thank you. I appreciate your blessing. Do you ever surface these conflicts to the user if they're potentially doing it by accident? Do you ever give feedback to the user that currently no, no, not right now. It's an interesting choice. It's one of those systems where or I guess one of those choices where, you know, is there, is there a likelihood that there is going to be a user even, you know, looking for these, do we just dump it to a log, like, is there just like a conflict log or something like that where it at least gets recorded? We don't currently do that. But it is, it is an interesting question. Like, I don't think it would be particularly difficult as long as that didn't have to fall on the right path. Like, I think being able to avoid dealing with those conflicts at right time is useful. But I think there is a potentially a system where users could get alerted to those. But yeah, we don't do that right now. Good question. So yeah, so kind of moving on to the final topic. I only have five minutes left. So I'll try and get through this. But operational simplicity is something that we've talked a lot about and talked to users about. And it's, you know, there are a lot of systems that are horizontally scalable. But largely, you know, they come with significant operational burdens. There are lots of, and I've heard, I've heard people complain about managing elastic search at scale so many times that it's, you know, it's great that it works the way that it does and is gives people the ability to scale out. But you end up having to do a lot of work to keep it running, keep it online, manage node health, things like that. And so one of the things we tried, again, to push for as a design goal is, you know, can we actually move to a world where we use something like object storage as our persistent storage? Like every write is guaranteed to be written to object storage. There's obviously going to be, can be intermediate state, whether it's caching or, you know, materialized views or something like that. But what if, what if you can actually make all of that intermediate stuff stateless? And obviously, there would be a performance penalty if you just tore the entire cluster down and brought it back up. But can you make it possible to actually have this entire system be stateless? And in so doing, make it a little bit more, I don't know, Kubernetes friendly for as much value as that has right now. Being able to run databases in Kubernetes is not great. But largely, that's because you're trying to do a lot of persistent storage on nodes or pods or whatever that want to be stateless. And so we've started doing a lot of work to look at ways that we can leverage object storage to be that persistent, persistent tier. We built a bunch of sort of caching technology and sort of we built this service called like a treasurer, basically is what we call it. And it kind of is looking at how, how frequently are you writing to object storage? Do you want it to be more or less frequent to sort of hit certain cost thresholds? And being able to think about that stuff as a little bit more of part of the database is something that we've kind of taken on a bit more. And there are going to be some performance tradeoffs and how it all integrates with things like the machine learned indexes is all still part of the part of what we're working on. But as a usability argument, it's appealing to users, I think, to be able to have something like this that could work in this way. So that's kind of our final thing that we've been pushing towards. So if you wrap all that together, what is RDB? So essentially it's, we talked a little bit about what is super indexing and it's basically a set of novel data structures combined with machine learning to handle that high cardinality, high dimensionality data. The internal type system that I talked about that basically lets you write schema list data or data without having to supply a schema and let those conflicts be resolved natively at query time, working towards a goal being able to have all state stored and object storage so that in the event that you need it to or want it to, that can be your sole source of truth and everything else can be reconstructed from that. Using, like I said, I think earlier, using principles like calm to try and get a coordination free write path and having near linear scale as part of the design. And then I didn't really talk about this too much, but we've tried to remain query language agnostic. We've done some work to build sort of an elastic search like DSL. We've done an integration with drill to have an anti-SQL front end, but essentially trying to provide a set of query primitives that allow us to work with a variety of query languages. At some point we might have our own query language, but I don't think the world needs another query language right now. So we're trying to stay neutral on that. And then with my, yeah, you're welcome. And then with my last minute, so kind of what's left for us to work on? So we're building a database that obviously will end up in production for our customers. So production readiness, there are a lot of things, back and restore tooling, even just deployment management, all needs to come. And that's stuff that we're working on right now. Finding more ways to leverage machine learning. I think the learned index structures paper is obviously, you know, it's an interesting direction, but there's still a lot more to be done in basically every possible thing you can imagine within databases. I think it can be interesting. Cloud native storage, object storage has its as its perks. Obviously not all object stores are created equal. So kind of finding the ways that we work with everything in some sort of equal way, even if the API is ended up being lacking is interesting. And then seeing more, like how do things like learned indexes perform in the wild? You know, as we see more and more user data will start to get a better sense for where it shines, where it fails, and how we can do better. And then looking at new use cases, I mean, obviously my background is time series, we've done some stuff with that some stuff with more traditional analytical workloads, but there's obviously a lot more data out there and looking forward to tackling it all. And that is time. Awesome. Six o'clock on that. All right. So I will applaud on behalf of everyone else. So we have time for a few minutes for questions if anybody has any. I have another question. Do you I'm assuming you have an API that people send requests to to write data. Do you guarantee that it is durable and object storage before you acknowledge the request? Yes. So right now, and that's something that we're working on, obviously, there are the there are a couple of problems with that. One, if you do, like if you trickle writes in, if you write one record at a time, you're basically going to get severely punished by your cloud provider for the cost of doing that. So we've got some work right now where we have sort of some batching thresholds and things like that. I mean, obviously the downside there is that you have potentially higher rate latencies. But right now, yes, we do guarantee that. And that's something that we were thinking about potentially being being flexible for certain workloads, like do we want it to be received by a sort of a right batch or something like that. And then in the batch behind the scenes. But yes, right now, all rights are durable. Rights can potentially take a long time if you write small batches. And related to that, do you do you ever compact small files and object storage if you get a lot of these small rights? Yeah, good question. Yes. So we do have some background compaction processes now. That's obviously still an area of active sort of research for us is what's the optimal way to do that? How do you I mean, how do you decide what the optimal size is going to be totally dependent on data? And some of that stuff is, yeah, yes. So yes, is the answer. So what is what is the new system? So the new system at this point is that the farm data wrapper, which is a prototype, you're building a standalone like system that compute layer in front of front on S3 or whatever using it as the object store. What is the new system written in if not go? Rust. Interesting. Okay. And then the, it's still early takes a while about a data system, but like the current implementation of the Rust off doesn't have any learn indexed stuff from now. It does. And it's something. So we've been looking a little bit more at that being a modular choice. Like it doesn't, we've got some deployments where we're using it and somewhere we're not. So it's not something that is required or necessarily used for everything. But yes, it is, it is being used in some places. And then people are coming to you is because they influx is not scaling for them or timescale is too expensive or what like, what is the use case? Are you seeing greenfield deployments or you coming and replacing something that exists already? Yeah, so that's a good question. So I think I said a little bit that we're not necessarily targeting like pure time series workloads. So I think if you look at the space that, you know, inflex DB is in or timescale is in, it's a lot of observability data. And a lot, I think a lot of those use cases are shifting towards, or I mean, it have been shifting from ethios and, and things like that that are a little bit more like native in the Kubernetes ecosystem or people are going to data dog for things like that. So we're seeing, I think the, the stuff that we're seeing more of is less of the like observability metrics and more things like distributed traces logs, like it's stuff that's time series, it's fundamentally time series, but it's not the same kind of stuff that you would necessarily see inflex DB handle. So some of this is, I think, greenfield, some of it we're also seeing like people who are using elastic for some of these use case. So it's more unstructured than what you would normally throw at timescale? Yeah. Yeah. Yeah. And so I think, I think to your, to your question, I think there's, it's a good mix. Like there's some greenfield, but I think a lot of it is people who've chosen other systems, whether it's elastic search, and I think if you look at large elastic search deployments, I think there are a lot of places where we can deliver an improvement over the elastic search experience. Okay. And then I guess the last question would be, what is the highest dimensionality of the data sets you're looking at? Like millions? Yeah. So we've been talking about dimensionalities in the millions. I think, yeah, I think if you, if you ask Robert and his eyes get all like twinkly, he's thinking like, you know, billions, but I think practically we're talking millions. The goal is, you know, essentially making it, making it limitless. Like I think whether it's cardinality or dimensionality, we kind of want it to be something that is not a constraining factor, but I think, you know, practically millions. And it blows up because you do that explosion thing where now you're embedding, like the, like you turn the, you know, the, the fields and tags when it was until like these, all these true false attributes. Yeah. And that's why it explodes. Yeah. And I think there's, yeah. And I think that that's part of the, the stuff that we started that falls under the, under the super indexing label. And that's, so that's kind of an active area of development for us is figuring out how do we, how do we work with that data more effectively and how do we actually improve that, improve that performance and just make it, make it more of a practical real world tool that, that hits more of these use cases. But I think, you know, at the same time, we're able to work a lot more, we're able to deploy a lot of other, you know, types of indexes and things like that as well. It's almost like, it's almost like an array database. So I wonder if there's anything from that world, like the TileDB, SiteDB to a lesser extent, maybe like you leverage for how they store things and, you know, compress things better, better than just column, straight column store. Yeah. That's interesting. I had looked at TileDB a while ago, but yeah, I'd be interested, interested to hear more of your thoughts on that as well. Yeah. Okay. Well, not now because I, my, the baby's screaming, I gotta go do that. Okay. All right. Todd, thank you for doing this. Thank you for spending time with us. This is very, very interesting. And I think what we should do is after you've proven out, maybe the learning indexes work and you have, or don't work, we agree to invite you back and give another talk. Yeah, that'd be great. What'd you learn from learning indexes? That's, I like it. That's a good topic. But yeah, I'd love to come back, love to talk more. I think it's stuff that's, that's super cool and would love to share more of our, of our findings. Okay. Awesome. All right. So again, thank you, Todd, for being here.