 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimise your MySleep call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Let's get started. Welcome to another Carnegie Database Seminar Series. We're very excited today to have Dr. Barton Brennbauer, a bower. He is the VP Engineering at a semi-new database startup called Relational AI. They've kind of been in semi-stealth, and so now they're coming out talking about all the cool things that they're doing. So we're very excited for him being here. So as always, if you have any questions from Martin, please unmute yourself and say who you are and what you're coming from and ask your question anytime. And that way Martin is not talking to himself for an hour. We want this to be interactive. So Martin, thank you so much for being here. Go for it. Alright, thank you Andy. Thanks for the invite of course. It's great here. We are a big fan of the Talks Here at SMU and so we're happy to talk about Relational AI. Thanks everybody for joining us as well. So like Andy said, I'm the VP of Engineering at Relational AI. So I will be presenting, but of course it's all work for my team. And it's a great team. We're very happy. And there's many of them in the call here as well. So enjoy that too. Okay. Alright, so what do we do? So Relational AI is building a database system for intelligent data apps. Now, let's unpack what that means. So intelligent data apps are typically built on the modern data stack. And they include predictive and prescriptive analytics, which is machine learning optimization. So we think that intelligent data apps are best implemented on the Relational Knowledge Graph. And there's really not a database system out there that you can use that is designed from the ground up to manage Knowledge Graphs at skill. And that's what we work on. So let's first briefly talk about what the modern data stack is. So the modern data stack kind of captures the next step after the industry went to cloud computing. Like when it first moved, there was a lot of infrastructure that was designed really to be on-premise. And it just was moved to the cloud and it just happened to be on the cloud, but it was not really architected for it. So systems like Postgres or Oracle or something like that. So in the last six, seven years, there's been a movement to get these systems to be cloud native. And then the cloud data platforms arrived in particular, for example, Snowflake. So large organizations, they typically have hundreds to thousands of applications. And each application often has its own database designed in an application-centric way. And you can imagine what the nightmare that becomes in terms of data governments in those companies. So the modern data stack addresses that by bringing that data together from all these applications into a cloud data platform. And then you end up with the data of all the applications in one system. And now you can join across and manipulate the data across those applications. So you can use the data in this cloud data platform to serve like PI, data apps, ML workflows, and so on. All right, so let's fill in some logos. So the data apps on this stack typically require you to go into something procedural or navigational. They are not using declarative technology like the cloud data platforms are. Like Snowflake and Google BigQuery and Databricks and all that. So the procedural ones are typically like either use tensors because of machine learning or it is navigational because you use a graph. Or this is procedural because there's nothing else available. And they are not designed to be relational or cloud native. So modern databases are cloud native. So databases like BigQuery and Snowflake, they essentially have infinite storage capacity because they use the cloud as the storage layer. They also have essentially infinite compute capacity because the compute can scale independently and you can provision however many CPUs you want. So these databases also support versioning, time travels, copy cloning. And that means that you can have workloads on the one hand or maybe load some data into the system while separately you're doing some exploratory analysis as a data scientist. And all these activities are isolated from each other, which is very nice. So current cloud database systems like Neo and TigerGraph are not designed to be cloud native. They are originally designed to be on-premise and they don't enjoy the elastic properties of the cloud. So in a general, essentially any system that was originally designed to be on-premise just has a harder time doing this because you need a lot of storage capacity to implement these kind of features. So as you'll see next, we pair basically with a cloud data platform and we don't really care which one that is. Like the capabilities among these systems are very similar. It's like Snowflake and Databricks and Redshift. It's a very competitive space. The opportunities of impact are somewhat limited. They're fighting each other, benchmarking each other. And as part of that, they're also clearly telling what workloads they support and do not support. So this picture is from the Snowflake homepage. And Snowflake indicates that you can use them for data engineering, some elements of data science and some elements of data apps. So they're also pretty clear about what they do not support. So for example, if you want to do graph analytics, then they typically encourage you to take your data elsewhere. So for example, they would point you at a navigational graph system like Neo and TigerGraph. Or maybe even just start programming procedurally. So if you want to do any kind of reasoning, and we'll get into what we mean in reasoning later, you basically have to leave and go into Java or Python or maybe you would use a rule engine or something like that. But in any case, it is always external to the database engine. If you want to do relational machine learning, we'll also get into what that is. Or sometimes even any kind of machine learning, you also have to leave the database paradigm and you go into things like TensorFlow or H2O and so on. And if you want to do mathematical optimization, then again, you have to leave the cloud-native environment. And every time you have to do that, you don't enjoy all these compelling features like elasticity or storage and computing of that, the versioning, data sharing and workload isolation. So what we do is that we intend to complete the modern data stack in a way that you do not have to leave the environment anymore for these data app workloads. So the way that we do that is that we invented new algorithms and techniques that make it possible to support this kind of reasoning and graph analytics and optimization workloads relationally. And that's what much of the talk is going to be about how we do that. So in this picture, the modern data stack becomes a larger box that has the cloud data platform like Snowflake and Databricks and all that, as well as relational models graphs. And we bring these data apps that were not initially implemented to be cloud-native and relational. We bring them into the modern data stack and they become declarative. So it's kind of an interesting movement going on. At least I really enjoy this, is that all kinds of people are starting to realize that of course that all the data that if you bring it together in the one database management system is still kind of difficult to work with. Like it's physically in one place, but it still needs to be integrated and used together. And so the way that people are addressing this is that people are talking about building a semantic layer, so that they can manage the complexities of all these different schemas. And a primary one is DBT, they have also given a talk here. And the semantic layer, that basically lets you define concepts and relationships and over-arts all these various data sets. And then you talk with a semantic layer instead of with a low-level database. And I'm very excited about this because like tools like DBT and what they're trying to do at the semantic level is really stuff that databases have never been extraordinarily good at. And so I think this will really push what databases will be asked to do basically and that's always good. So this is also incidentally exactly what Legend does. And so Legend was originally developed by Goldman Sachs and it was recently outsourced. And it is also addressing the exact same data governance issues as a platform for data apps. And what's kind of interesting is that I understood that they actually have 20,000 users internally at Goldman using and building data apps on Legend. So that's kind of huge. And to give you an impression of what it is, because like all these words they might give you the wrong impression of what it is, basically it's a textual language. It's a beautiful modeling language. Here's a Legend model with concepts like person and firm and big firm. And you're basically able to derive what a big firm is with some declarative query there. And this is how users define concepts and relationships, which is very reminiscent of what DBT and also Lukamel do. So if you look overall at the landscape, then you see tools arising like DBT and Lukamel and Legend and micro power apps to some degree. And they're all trying to serve this data apps market or like apps that do something more intelligent with the data sitting in the cloud data warehouses. And so we think that it is really important that data apps are going to be relational, that they're going to be declarative. It needs to be an expressive language, you'll see why. And that needs to be based on executable models. All right, so before we dive in, I need to briefly show how we use the relational model, because it's a little, it's totally different from what we maybe are used to. So to start very basic, let's just take a look at a simple graph. This is just a normal graph with binary edges. And every edge in this graph becomes a tuple in the relation. We name the relation adds in this case. And so for example, you see there's an edge from three to two. So you see here like an edge three to, yeah, okay. All right, so let's take a look at labeled property graphs as you can find them in Neo4j and Neptune, for example. They can perfectly fine be represented as relations as well. So properties of notes are binary relations. So for example, fill enough. The director is a binary relation here that the person with identity two has passed the name fill enough. The labels are general relations like let's say director is for example, this year and actor is there. And the edges like directed an actor, they become binary relations. So let's say there's a relation from one to three there. And you see that labeled property graphs also allow properties of notes like this which is here with the role that Salome played. And he plays in movie actor one in movie three played the role of Paul trace. Yeah. Okay, so that is labeled property graphs as relations. So okay, so let's take a look at tables. The, so if I would ask you to take this table here and make a graph out of it, then I would guess you come up with something like this probably. So you make this the primary key. So you make a note out of that order one and order two, and you link it with access to the different things that come up. Yeah. So if you map that back to the pictures that I just was showing you then you basically get this right. So you have a customer, the customer of order one is 500 the customer of order two is 73 the date of order one and two is this and the price is 7543. Yeah. So this is exactly how we model SQL tables. We try to model them more graph like, and in a sense, the way that we see SQL tables it's kind of like a modularity construct really is that it groups a bunch of relations together that share the same primary key. Yeah. Okay. All right. Now, let's go to two vectors and tensors. So vectors can be more those relations as well. So the first argument is the in this example is the index one to three. And the second is the value for one eight. And similarly matrices are turner relations. So you have two indexes now right one one minus one three and all that. And so it's really highlights that the relational model is very universal actually right you can virtually more anything in it. And this basically leaves the challenge is that okay, like, of course, it doesn't need to officially be implemented right so the logic is not the problem necessarily the implementation is maybe itself. Yeah. Okay. So matrix multiplication on these tensors is very easy to define actually in in relational systems. So this is a matrix multiplication you're computing these cells there. So this is the math expression there that you see. And this is actually the final like this in so we are designing a language for real. This is our new relational language. It's very close to the tensors so it looks a little bit more compact. This is not to show that real is better than sequel. It's just like the design bias of rail is more in this direction that's why it looks nice. Yeah. So basically, it's almost a direct syntactical mapping to the medical tensor notation, right. And hopefully that's clear. Yeah. Now so, if you do this for sparse matrices, actually our performance results on this are actually quite good. But if you do it for dense matrices, then it's not so amazing. And, but then the challenge I think to the racial systems is that it's data independence right so the logic is independent system implementation, and they should innovate to implement these dense dentures efficiently. Which is exactly close point. Yeah, quick question like the rail programming language that's what you would expose to a somebody was using relational AI or this is something that you guys maintain internally and there's a higher level of language above this. Yeah, but so we do expose the language so you can create a system in this language it compiles into an IR. But we also have projects ongoing where we layer all that in some top of rail, because we're not necessarily like we're more like the generality and the model of rail than necessarily that we're married to the syntax. So boats basically. I don't like the realm that that's a level that you guys didn't reason about before you go to the IR. Yes, yes, yes, yeah. All right, so like I said, so this basically the original point of course, of course, responsible for muscle relation model. It is a beautiful paragraph right into the paper where he says that he thinks that they should yield maximal independence between program on the one hand, and machine representations and organization of data and the other. So that the goal, right. I think that's a great goal. And I think it's good to keep challenging that and see how far they can push that. Okay. All right, so let's talk a little bit about reasoning and explain what how we see reasoning and what that means. So, so it's not a very widely used term so let's first take a look at a very familiar example, at least to people from the SQL database community. So it's interesting that you want to develop in develop an intelligent data app for an order database like to be teachers or could be more good or something like that. And you may want to include some analytical features like maybe tracking a metric for the average charts on users or something like that. It's kind of interesting like this is actually where the problem already starts, because the deep to see a schema does not actually define charts, it's only used as a term in the description of the queries. It's actually like in the formula defined concept otherwise. It's kind of interesting that the order table does actually have a total price, which turns out is actually the charts if you sum up the charts from the from the line items. So it's kind of a computed column that is sitting in the order table that you could actually compute otherwise, which I think belongs in the semantic layer that's the point where we're driving at. This is a pretty common issue like we see this all the time that data comes in that is partially computed and you have to go figure out the inconsistencies and all that and what's actually real data and what is not real data is a very common issue. And I enjoy it kind of that even deep change like the most simple benchmark that you can possibly pick on the schema size at the ready exposes like real issues in here. So in particular, like I said, if you're trying to design a system that you actually don't define these terms somewhere, there is no way that you can nicely build some vi tool or use something that actually uses the terminology, right? Okay, so let's just go define these concept relations. So these are the real definitions again for I'm going to not actually explain the syntax much I'm just going to assume that you understand it. I hope it's true. Otherwise, let us know the and so we're defining this concept for charts and being late and for revenue and they are defined for in this case for orders as well as line as well as line items. So this is the line item on this order. So, so these definitions can also be kind of understood if you think of this as a graph again, you know, go back to the table discussion in the model, you can kind of think of these definitions as introducing new edges and note labels like maybe this purple one is for example could be these are the late orders, let's say, yeah, okay. All right, and these definitions that form a dependency graph this one is a simple one, but this dependencies. And this is very similar to what you see in dbt or legend or look them up. Yeah. Okay. So this knowledge that we just defined it's really playing active role in the semantic layer to form this relational knowledge graph. And it needs to be accurate enough to date so let's say if underlying graph changes like that same ads gets inserted that might have completely different effects on the knowledge that is defined. And so you would expect it to be propagated like for example maybe this ads is no longer there and goes away and the other one because purple. Yeah. All right, so this is basically what we do. And we do this at very large scale. So, like applications for our current relation customers have thousands of derived relationships defined. The dependencies of logic are complicated. There's a foundation of the system in data log. This is basically the underlying IR somewhere, and we support mutual dependencies between these relationships. And we use recursion visualization methods for that. The recursion also allows general aggregation and negation, which is very important because almost anything in the graph uses aggregation. Yeah. All right, so the kind of I can't really illustrate the complexity of the logic. First of all, the logic is confidential for our customers and second, which to be too much. So we decided to make to make a dependency graph for the application logic of our customers. So this application index analysis. This is not the actual data in the application. This is the code. Okay, this dependency graph of the code. And these are all every dot is a note that is a relation. And the answer to the dependencies between the relations. Really, this is kind of large as a user. You don't have to understand this. It's more to understand to explain the sheer size of the thing. Then there's something that you need to comprehend. Yeah. And this by the way, like this application like a places for this customer a few million lines of procedural code that they were writing. So those big it is much smaller than what they were mentioning. So this is also kind of cool example zooming in on the previous graph. You'll find all kinds of strongly connected components without a cursive right. So here you see a strongly connected components and every note in this graph is the definition of your definition and they're all recursively dependent on each other. Yeah. And then you have really dense notes as well. So this is one note in the graph. So there's a crazy number of outgoing answers. I don't know what it is actually. We just found it in the dependency graph. And so this really shows like the machinery that you need to evaluate as well. Like even at the data level some graph systems would not actually support these kind of notes very well, let alone at the code level. Yeah. All right. So if you have these ginormous models, then eagerly maintaining the model is not a good idea. Like if you have like a few thousands relations in any time you update you go check if you have to update this is not a good idea. So relation is entirely the month written. So competition happen only when they are needed. So the user asks for something, let's say in the tickers competition in the graph. And the architecture of this is really cool. It's based actually on programming language research for compilers that are designed to be used in ideas there has been a progression in compiler construction from bats to responsive incremental compilers that while you're typing, we compute their stuff in ID, which is exactly what a database also needs. And so that's what we use. It's kind of like a build system. So there is dependency tracking memorization cash invalidation kind of stuff happening. So the framework that we use for this is called salsa we open source this or you can use it. And there are a couple of talks about it. One is actually not from us. It's from the rest predecessor of salsa. It's a great talk. I highly recommend it. The other one is by our team members. Yeah. And also in the line. Yeah. I had a little like this is like this incremental and from an Audi and like through this approach. This is facility because of things like dbt or leisure where now, instead of like getting one off ad hoc queries in Tableau, people are telling you declaring ahead of time like what exactly they want and therefore you can take advantage of that and compile stuff. Yeah, yeah, yeah, yeah, yeah. I question in chat. Aaron, if you want to unmute yourself. Yeah, so like, yeah, you mentioned that it's demand driven. So how are you deciding when you should cash values versus it's cheaper to just recalculate things each time you need them. Right, right. So the caching decisions is currently by by us. So we decided the code level what is cashed. So we know the system. Let's say we pick that we could make the wrong decisions of course. And then we do have providers for this as well so that can help us understand the overall performance. And the way that we envision this is that basically, like, in print, it's kind of like deferred you may do this. Right. So like that's a you can diffuse could be deferred let's say in the week in the background outside of transactions we can start updating everything already and preparing for future transactions that may happen. And you could even imagine it really crazy scenario that we start using machine learning to predict what people are going to ask for at what point in time and actually bring those views up to date and all that. But like architecturally what we guess is in our codes is we make decisions right now would be interesting to the different. Yeah. He has a question. I got a question. Can I hear me. Yeah. Okay, so really two questions. So one is that I looked at your declaration of the data. It looks very similar to RDA. Am I wrong with that? No, you're not. So like it's similar but also different. So IDF just triples right and only triples. We do arbitrary editor relations. So you can do one, two, three, four, five and so on. There's a little bit more flexible. So let's say if you have temporal data in particular that is much easier to model in our system than in the IDF. It is also somewhat similar though in the sense that IDF allows you to have very convenient schema level queries like you can do generic queries over IDF graph and we support that as well. So there are some kind of conceptual similarities. But we are slightly more general in the data model. Okay. So then you showed you showed the matrix data type. So are you really using this data model for matrices? At the application logic level. Yes. Yeah. So that was the point of time to make is that like let's say if you go back here, like if you look at this like it's certainly not convenient like the way it's written here. But there is no redundant information in there. Like you really need all of this to some degree right maybe you might want to know that there's no index missing or something like that but this is really it. So the now this is not necessarily how you want to implement it. So you might want to have like dense data structures in your runtime to actually support this effectively. But yes, we are trying to do that. Yes, that's that's the AI in relation to AI. So then if you compare it with the system like system ML that we didn't focus on this one. And it does have open legs and I know that because declarative. Do you think that you guys can beat that? I don't know. I've never used system L nor benchmark against it. I've read about it. I know a little bit about it but not too much. I don't really want to say too much about it. But like we are more more general system probably in the sense that we also try to support other workloads. So I imagine that system L could probably do a better job than we do with their specialized right. That's the point. You're talking about high scale and very efficient competition with the automatic optimization like relational. So that's what system ML does. Yeah, I'm sure it's a great system. We just we also try to do other workloads. Thank you. Yeah. All right. Where was I? I was here incrementality. Okay. All right. Okay. So briefly to summarize the reasoning stuff and how we position ourselves or at least what our vision is, is that from our perspective like reasoning really subsumes. It captures basically application logic that is currently still written procedurally in languages like Python and Java. So we've all been conditioned to build our applications in the split brain architecture. And part of the application is defined in the database, usually a SQL database. And then the rest of the application is written client side in Java or C sharp or something. And what the app is trying to do is completely not in one place and there's no good overview. And so you can optimize across those parts. And the database like doesn't really understand what's happening. That's the dog picture like it's like blah, blah, blah, blah, read this, let's say right. And so we think it's about time that the app logic is actually expressed relationally and that it's executed by database. So bringing the app logic to the database makes it possible for one system to manage the semantics and integrity and the resources that the application needs. So we've been working towards this kind of solution already also before relation I already, and it's getting more and more successful, I think. And so in particular in the latest iterations like we have seen that some of our applications are getting like a 10th one time to reduction in complexity in the code basis. Yeah. All right, okay, so let's get a little bit more technical. We're going to dive into opiate storage and how we store data. Okay. All right, so this is the high level picture the high level storage management is very similar to order cloud data systems. This is not intended to be surprising. In any way. So we use cloud opiate storage for durable storage here. And we use ephemeral disks and ram for for casting. We support workloads that are larger than memory as well as larger than disk and then we just affect the effect to the opiate storage and handle it and then might load it later. So the way we kind of are a blend of let's say snowflake like IDs and umbrella lean store IDs umbrella lean store is an system that is designed to have in memory performance while still performing out of four. And we try very hard to be at that level. So now, so right databases are immutable, which means that data is never changed in place, right. So the key difference with other systems is that actually the entire database is immutable, including the catalog and this version in the same way. So if you look at iceberg or snowflake and they version individual tables but they don't version the overall the overall catalog itself in the in the same way. So what you see here is that you see here catalog this is what this thing is supposed to show with some pictures for relation a B and C, and there is an highly available key value store that we use with strong consistency that has an pointer to the root of the database. Okay. And the database called them up. So now if we executed the section section C, and it changes something in C, then we're going to, of course, we need to assert that with the C but C is an immutable data structure, it is right optimized though. And that is why the chases of C lands in some buffer that is somewhere in the BT we use the absolute trees, their papers written about that they're pretty good. And what happens is also that in the key value store the root pointer for this database now points into here. Okay. So note of course that the previous databases not modified and read only the sections that were running on that database can just continue without any concern really it still exists entirely as this. Okay. So now next maybe the user decided to take a snapshot of the current database and keep that version around under some other name, right created that thing. So in our system this is strictly an operation because the only thing you're copying is the pointer. And it supports like concurrent schema changes in full generality that's very important to us because knowledge graphs are very dynamic, all schema changes are happening. And so this needs to be supported really well. So if the previous database is no longer used, then at some point it may be garbage collected and data that was only a force for that database does not reachable will become deleted. So that's why you saw that the seat gone right but a and B is still there. That is because that is still visible. Okay. All right, and then another transaction happens on demo that points to the new thing and of course the snapshot still points to the previous version. Okay. So, while immutable tables are pretty common. The key distinction is here that our entire catalog is immutable and versions. And this level of immutability, basically trivially supports strict serializability. We think that is critical for data apps. And there are some issues on that in particular by Peter Bailey's who has, who has been looking into isolation levels and potential violations and problems that arise with a split brain setting. And we think it's very important to keep that that isolation level. So read workloads do not need to apply locks because the data is immutable so why would you lock anything. And there's also no coordination whatsoever needed with any other engine that may be running. And so we scale really well for, for read workloads. You can basically spin up however many CPUs you want and do read on the workload on this. I can write a set of cloning is a through a one operation that's great. And, and as our next is that the, so we'll get back to this why this is, but our entire database is indexed, which makes it really important to have right optimized data structures, because index basically means all your data is sorted. If you have random insertions on an entirely sort of database that will perform very very poorly in particular in a cloud or big sort of setting. So we use immutable right optimized data structures and cloud or big storage. This is a really good fit. Because right optimization limits how much is going to be written to a big storage and cloud or big storage you kind of don't really want to mutate anyway. And so this is actually a very good combination we think. Also, we don't have a transaction log, because the every every right transaction creates a new pointer to a new, a new database. And that is automatically updated in the key value store and anything that goes wrong before that is just it doesn't matter like nothing got changed basically so there's no need for the transaction log and procedures for doing stuff or whatever. Yeah. All right. So, here's some papers that influence our thinking hopefully this is recording if you're interested then you can read more about exactly what we do there. Yeah. All right, so the next section of joint algorithms. That is the one of the exciting ones. So the, because this is where things are pretty different. So currently, most SQL systems they use binary join programs. There are some variations, but generally let's say they use binary joins. And they basically joined two tables at a time. Now if you're in a knowledge application, and then you're often joining many relations, and often the intermediate results are very large if you don't know the only two at a time. So this is a triangle query. You see here so this is an internal graph pattern that illustrates that an director has a child who acted in the same movie as he directed some kind of trying to find conflicts of interest or something maybe. And if you look at what the options are is that any binary plan here as just is just too large like there's no combination of these edges that is reasonably small like all directors have children mostly probably every director has movie every movie as directors and actors and every actor has parents so there's really no tool that you can pick that actually isn't reasonably small results. Yeah. Now I think this is maybe one of the reasons why practitioners often complain that joints are bad or you shouldn't use joints or whatever is I of course don't like it all. The is that there are multiple reasons for it is maybe one. So the so we use this thing called risk is optimal joint algorithms. So it's a new class of algorithms, whose properties are, we're still kind of even figuring out ourselves probably like we're kind of just keep discovering new interesting insights of what you can do with the algorithms that are very interesting. I'm actually going to show it from a few angles, which hopefully will make it more clear what the point is so first of all joints. So the first angle is partially, I think that's usually how to show. We're going to look at subqueries correlated subqueries and then the third is index selection. All right, so first partially, I'm going to not spend too much time on this because I'm always going to get to other interesting parts. But basically, which is more joints that use partially on all relations that you're joining at the same time to narrow down the search. So if you're looking for a female Asian director and Oscar winner, then there is exactly one who happened to win Oscar last year. And different that some of these have interesting different sparsity patterns, you can very quickly with this algorithm find that person. So in particular, like let's say here, in this area, like these are supposed to be tuples here that's ID. And then these are the relations and the adults are the present facts. So in this area, there's no Oscar winner. So that's why it immediately jumps here. And then there's no director here, which is why it immediately jumps further. So this is pretty well documented in papers. And then this property of where's not more joint arguments is pretty well understood, I think. Yeah. Okay, so now let's move on, though, because what's interesting here is that for the unitary case, like first up more joins are essentially similar to merchants. So this doesn't seem very new at the unitary level. The interesting innovation, though, is that this kind of narrowing down of the search is done continuously at every level. Yeah. And so to go back to the triangle example, like let's say that we have to pick a very warring in these algorithms. So let's say that we start with a director and then the actor and then the movie. So let's say that you first go start looking for for for directors these. Yeah. Then you're going to join a child and directors, but we're not yet looking for specific actors and movies are only joining on the D. Yeah. So we're going to look for directors who have directed some movie and have some child. That's basically what we're searching for. Yeah. That's actually fairly efficient query because that's a subset of all of all directors, which is not that huge. So now we have a D. We have a D. Okay. We will move to find children a who acted in some movie. Now a occurs in the child's relationship as well as in the active in relationship, right? And that's interesting because it's actually fairly narrow because now you need to have an we need to have a director that has a child and that child acted in a movie. That's probably not the most common thing ever there probably some but it's not like a multi billion population. And so that's going pretty well. And finally, then at the last step, we take the m's and then we kind of narrow it down further and we find only the ones that have a conflict of interest, let's say. Yeah. All right. So that's sort of how this place. Okay. Now, while I was explaining this, you might already have gotten the idea that it actually looks kind of similar to correlated sub queries, right? Because once you have a D and then you're going to do a lot of stuff that is the kind of like a sub query, right? And which is exactly the case. So in risk of my joints, basically every next variable that you're joining on kind of acts like a correlated sub query. And so here's an example just everybody can follow it is that this is a sequel corollary sub query. It's a query within a query. It is counting the post of users in a certain country. And the, you can't count it right now in the users, let's say, yeah. So the like you can evaluate this in various ways. Like the best thing to do is equal systems to be correlated. You actually analyze is very hard. And then you find a way to execute it, which is not the nested sub query. Basically, that's sort of what you're supposed to do. And that's trying to avoid a nested loop on nested loop on is where you go over the outer query and for every triple you go to the inner query. That is typically expensive. So you could also like first evaluate the sub query and then maybe over calculate. And that's also very risky, of course. Okay. So in relation I we use two complementary methods to handle this. So one is, is that we, if the sub query is not correlated, we use a completely different part of the system that's a semantic optimizer. And it will already optimize that away and do something entirely different for it. So let's just ignore that. So for the correlated queries, it's really interesting because we're just the optimal joints are really correlated correlated joint device. That's how I like to think about it. Yeah. So they are good at correlated queries. And I'll start explaining a little bit by that is, yeah. But basically, like the next point is that they're not only correlated joint devices, they're also indexing devices and together that is a great solution for correlated sub queries. So let's go into the indexing because that's combined with the other thing explains it. Yeah. So for SQL systems and many other systems, of course, not necessarily SQL only is that selecting the right indexes is very hard problem. It's not really solved as hard for users need to understand the workloads. All the tools exist, but they usually play an advisory role and then you still need to make your own decisions. But at our scale, like with this graph that I just showed you, like there's no way that our users are meant to be console like indexes, right? So that's just not done at work. So we need something different, more robust tool, writing large quantities of logic. And so the way that we recast the problem is that we automatically create an index on every relation in the system. That's why I said earlier that the database is entirely indexed. And because the skewers are graph like, even if they actually be percent tables, right? The relations are very narrowed. I can typically only like erity one to three or something like that, which is like an RF triple, right? And an RF triple, they do the same thing. Actually an idea systems, they typically index the entire database for the combinations of the triple. So given these indexes, we're going to join is essentially a device to make an arbitrary composite index. So we make all these little building blocks indexes. And then on top of that, you can make any composite index. And that is, of course, very powerful. Yeah. Okay. So that's exactly why we've got more joins do so good at graph patterns, because there's exactly what graph patterns are typically, right? You have some things, some properties or notes that you're selecting, which is exactly meet these criteria. So as an example, to understand this is, this is a graph. It's an information of cars. You have two cars, one is a cheap and one is a Ford and an escape and it's okay, right? So given this schema, I would make these brand indexes and model indexes. There's two of those per item, right? And those are the building blocks for the composite indexes. And then all these other indexes, they're created on the fly for free when necessary. And there exists for all the combinations of the properties. And this is not an overwhelming picture and how many of these indexes there are, but add one or two more kinds of properties and it'll explode with the number of options that exist that are all indexes that are fully available to the query failure. Yeah. All right, so I think that's the key point of crystal more joins the now how do we implement all the stuff. Now, this is the classic story about query delivery. There are kind of three variations that you can do. You can do a classic to play the time interpreter, which is low latency, but has a high overhead to pull. You can do compiler, which is high latency and not say high latency and good performance, but to pull right. And then you can do a vectorized interpreter which amortizes the cost of the of the interpretation. And that has the sort of the best of both worlds. Now, these are amazing. However, nobody has yet figured out how to factorize and was able to join. So there are people working on it, but that's not that successful. So what we do is that we have a compiler and a factorized interpreter and they work together. They're implemented in this length of Julia, which is an interesting language because it's very high level would also allows for system programming. And this kind of helps with the maintenance concerns that typically exist when you have these multiple backends because Julia is very good at compiling and inline and optimizing stuff. And so we can implement this at a fairly high level and actually share which of the infrastructure that is behind these these steps. Yeah. And what's interesting is that we have designed this that actually these can be embedded so the compiler can actually be invoked from within the factorized interpreter. And so you can actually have a creative plan that on the outermost is is an is a factorized theory, but in the most source of brushable joint parts basically. Yeah. All right, now there's one more innovation. I think Aaron's got a question about the Julia stuff. Do you end up with an explosion of types because the Julia compiler is trying to like compile a new type for every schema you could possibly get. Right. Blow out your eye cash all the time. Yeah, clearly, you know, Julia, the so we. So, so Julia is so weak. We exploit this of course to some degree like we want this that we actually exploit the type system. However, we also like compile it in bits and pieces that actually doesn't use a lot of functionality. So most of the queries. They, they, well, I'm not sure what numbers you're looking for, but like, it's not excessive. This is not a C++ compiler. Let's say right by our systems that simply go C++ compiler. There's a very low latency experience. This is significantly better because of compiling bits and pieces together. Yeah. Okay. All right, so one more innovation here. I'm really running that on time. That is the docile join compiler. So this is not a progress. We really would like to write a paper about this is that so docile join is a new algorithm to be invented. So based on our experience with worst optimal joints, we found out that the first generation of algorithms, they were really designed with interpretation in mind. And therefore they had some runtime bookkeeping that had to be kept. And that cost something. So docile join is a variation of the algorithms that is specifically designed for compilation and compiles this overhead into the actual code. So it's a, it's a users of state machine approach and it is very fast. And we're very eager to start talking about it more. Yep. I don't have too much details on this otherwise. I hear some papers. I'm going to specifically move on to semantic optimization. So, semantic optimization is, is the is the is a high level optimizer. And the idea is very simple. Let's say you have a model and real application logic you have some knowledge. I'll show you what it is. It goes into the semantic optimizer. And the optimizer model comes out and it can answer faster. Yeah. The, the kind of knowledge that the specifically exploit here is, is axioms axioms or let's say the combinations of plus and multiplication, minus and plus and so on. Yeah, I'll show examples. So let's say if you take a min aggregation of an F energy and I and G J are independent. We, with our semantic and algorithm properties can include that these are independent and you can separately do min F plus min G. Yeah. Now if you do a min of F and G where there is a dependency actually then this is not valid because it has the same has to be same I right you can split them out and then do the lowest that's not that's not correct. So this cannot be optimized in this way there's a dependency. And then this is the count example if you count the Cartesian product, then, of course intuitively it's probably clear that you can count the individual ones in a multiply them. It's very surprising is still SQL systems still often do not do this and I still find it hard to understand why not. So now you might think maybe at first glance you might say oh okay cool but seems very syntactic I can do that too right. The, it is actually fairly deep like we really understand the semantics of the logic so there's an example where I slightly refine this, this count of three independent relations, I put a little bit of a condition on that it kind of relates them let's say right you can't just do something simpler anymore. It's still actually understand the structure of this problem and decomposes it into an AB problem and a BC problem and then multiplies it to and sums them up. Yeah. So, it's really understand your, your math and optimizes it accordingly. Okay. So here in that, if you have recursive definitions, we can push aggregations into it as well. So this is an, an path recursive definition for computing the length of a pot. This computes all the parts in a graph, which are, of course, many potentially could be infinite. If it's cyclic. And then only at the very end we say I want to have the shortest. aggregation for that. So our optimizer understands the algorithm properties to the degree that it can actually push the min aggregation into the recursion. It now is in the shortest path competition, and it will only compute the shortest part. And to some degree that's actually die size algorithm. So we just wrote an algorithm inventor here basically kind of cool. And what goes further is that like this example that after you this is actually an old pairs stores pots and that is very nice but it doesn't scale the if you have a graph of a million notes, there are a lot of pairs. So that's not a good idea. So typically you actually want to have a more specific parts. Yeah. So we support that to we supported with the mountain formation. The example that we use here is Kevin Bacon degrees of Kevin Bacon plays in many movies and this is degree of Kevin Bacon of how far away are you from him. And if you invoke shortest parts, you ask in the outcome only for Kevin Bacon, and our system is able to specialize the computation, the shortest part now to even include the fact that the only one started Kevin Bacon basically. Yeah. That's what you see there. Let me skip this on that is more interesting things coming up cool papers. We have an other paper this year, which won the best paper awards. This is particularly on the recursion competition is very interesting. All right, so the language, the, the language is called well, I'm not going to explain the structure of language that I want to explain some of the design ideas and why we created it. So, in our past, we worked with the day lock and they look is great, but also very first order. And so just not support abstraction in any way. And we realized that we want to scale up what what what is necessary for the semantic layer like we have generic algorithms you want to have libraries and reusable codes and all that. And we needed to design a language for that. So we want to have abstractions for statistics, let's say that in the language you can then find standard deviation, for example, and it just figures out how to compute that efficiently. We want to also abstract up schema. So let's say you want to do machine learning applications that are data frames, let's say data frames is really a collection of relations, and we want to do generic stuff over that. And, and then we want the whole thing to be live and very easy, let's say if you have a Jason file you want to be imported and just not have to worry about the clarity schema, let's say, okay. All right, so the, just some quick examples like from the standard library. There's a file with like 6000 lines of was it that defines all these abstractions so there's a laser algebra abstractions, statistic stuff and upgrade properties. I'll move on, yeah, because there's two more interesting stuff later. This the graph and the graph analytics library. So this is the narrative defined as a library that you just can instantiate from your application not a template. You don't have to copy it or whatever you don't have to change it. And you give it a graph, and you get all your typical graph algorithms is of course a subset of what they're actually will be in there. Yep. And then I really like this one. So like I mentioned this data frame features example is like features is really a set of a set of relations and it's really interesting to see data scientists work and see what their needs are in their workflow. And you really want to see all the statistics about all the relations in their data frame. That is what describe does, yeah, there's no SQL system that really does that way give it a whole bunch of tables and gives you statistics about it. So a rail support the kind of abstraction so you can describe is actually defined in the language in the library. It does some metaprogramming over the parameters that come in and then generates all the right aggregations for you. And that's exactly what's happening here so this is the penguin set. And so you get all the statistics like apparently there are more male than females in this and they're all on the island Pisco. Okay. I will have a technical strength they're very powerful but I want to show other stuff so I'm going to move on. We also have incremental computation. I'm also going to cover that very quickly like basically all these, all these concepts relation that we defined they need to become evaluated to scratch all the time. We, we use incremental methods for that. We are friends with Craig McSherry will office work so we use differential data flow in the general case, and then we optimize to special cases. And that's basically what we do, and then I can skip all of this. Yeah. Okay, I want to do this part in particular relational machine learning. So, currently, if you do machine learning, then essentially what happens is that you might have a beautiful relational schema, you did the very best to know if any redundancy in there. And then you have to run to go into a machine learning tool these works with with matrices and to be specific one matrix. Yeah, so you have to entire schemas to be transformed into it. So you need to join it all together. And you get kind of the ultimate denormalization of your entire database schema in that matrix. So to show you concretely what that means you don't have to understand any of this is briefly going to explain is let's say if you have a sales data set skew store dates you're going to predict how much you're selling. Then you have to join all these properties of all these other of the skews and the stores in the mix into it and you can just really whiting with all all this redundancy and what you see for example is that repeatedly you see that the skew is 5.14 dollars. And so, yeah. Okay. So the, so we have devised methods for implementing machine learning in and in itself. So with our recent network notably, we have to develop methods that do not create a design matrix and operate directly on the relational structure. So for that, we needed to invent a bunch of things like we need to figure out how to write the models generically, we need to differentiate them take the derivative of it, and we need to optimize them. So, you probably get the idea from earlier we're really good at describing generic machine learning models relationally that's what you see here. This interesting part is that what happens here is let's say if you listen for a linear regression linear regression you typically computer, a covariance matrix. And that matrix is that you do that as aggregations of the design matrix, but the design matrix has an incredible amount of redundancy. So you're doing all of the necessary work presumably. So what would happen if you wouldn't do that. So, here, I gave the real definition of the design matrix aggregation design matrix is a generic thing with and feature as an argument, let's say that's the column that you're dealing with and then you aggregate them all together. Imagine that you specialize that to the price of the skew and the size of the store. So you can imagine that these J's in this case are replaced by this right, and they immediately see that this is kind of weird because like a price of a skew is completely independent of the store of the day in the data set. So, like, why are you multiplying these numbers if they don't matter right they're all the same. So our cement optimizer knows that structure that understands functional dependencies, and actually can compile that into or optimize it into this computation where we take the price only once from the original relational structure, and then just multiply the accounts of the other things. Yeah, so that is again semantic optimization network. Okay. All right, papers. I should probably stop some right and the how I'm on time Andy, but for about a minute. Okay, cool. Yeah. All right, then I can do this quickly. So we also do medical optimization. I'll briefly explain what it is. Maybe not seen it in a while. So what I just showed you is an old strain optimization. So there's an objective function, which is the cost function of the model. And then typically you take the derivative of whether you do something like really dissent and then you find the solution and what's interesting a particular that all solutions are acceptable. You just want to find the cheapest one right. So constraint optimization medical optimization is different in that there's a cost function, but it also constraints associated with it. And these are there's a completely different complexity class of computation they find that fascinating. And so you go to solvers for that like LP and I'll be in the kind of stuff. Yeah, like groby is a well known one seed blacks and express them. Yeah. Now the way that you code these normally is via API sort of the ample or something like that, and they're actually beautiful high level mud. And so I have a couple of examples here so this is the textbook manufacturing example. And this is the model in jump with this Julia library, which is great. It's a very high level spec of optimization problems. And if you will if you look very quickly you can see that the objective is the maximum of the sum of the profits. And here the same right you maximize the profit and it's pretty declarative like it's math. And, well, I just showed you a lot of modern rel, like certainly you can write as a rel as well right. And that's what we do. So real support expressing these objective functions and constraints, and then these actually have to be grounds to the data that you're dealing with and that happens in the database, which is something that isn't databases are good at that. And then we give the cost function and the constraints to the solver, and the solver will find a solution for us so it will create a relation and in this case it's going to invent and make relation for us. The design of this is very similar to jump the jump is a great system Julia and the way that we did this is very very similar. Basically, this this code is symbolically evaluated. And instead of querying it creates the solver spec. Yeah. And the cool thing is that this all happens in dependency graph. So you can actually inputs into the optimizer can actually be from your semantic layer. Right and you can output of it can also be used in a semantic layer and it can go into the machine learning problem or something like that. Right, so it's really completely integrated in the semantic layer of the system here. So we use Dr. B. So Dr. B. If you don't know it is an embeddable SQL OLAP system. It's great. It is very high quality is fast. It is getting very popular. So we use Dr. B to get SQL support. So what we do is that rel is used to model SQL tables, because like you probably gathered is that a rel relation is not a table. Like a table is a collection of relations. And so you can define those in rel like like this and externally that looks like a table. Yeah. And so this is the mapping that we do. And then we use Dr. B entirely for the query evaluation. And what's kind of cool about this is that because you define it in this way, you can actually compute an individual column. So you're not stuck with, let's say in SQL, you can only do like tables or views, but you can't put one view column in the table. And but because we actually decomposed the table in these separate things, we can actually do that. So you can actually compute the columns into the into a table or license data. So we did look at some other pros and happy to talk about our decision process, but I'm running out of time, of course, and we took Dr. B as in the foundation now and we are members of the foundation and we're, we really like working with these guys. I had a recap, but I'm really low on time. So let me not spend the most time basically like you learned about reasoning immutable databases. We do vectorization versus multiple joins semantic optimization incrementality. We have a language rel, narration machine learning, minimum optimization and SQL. So like Andy already asked at the beginning, we have 125 people. It's really cool to see how many different companies to come from it's really interesting to work with people with so many different experiences. We have investors. We have both Merglian was also an investor, I think he's on the call maybe. And we have a recent network is really great. I would love working with academic community. And we have some great people like Peter Poles and then of young density show in our network that we work with and we regularly meet with to learn about catalyze and work together research. That's it. This is a lot to take in very fascinating. So yes, we have time for one or two questions from the audience. I'm all if you want to go first. Yeah, I can't share your sentiment. There was a lot in this stock and I have a whole bunch of questions, but let me ask a main one. I think I'm already answered my question in the chat. But I was curious as to how the different pieces here like architecturally fit together. You mentioned Julia for compilation and you mentioned Doug DB for querying. When you earlier mentioned that price execution. What does the architecture of the system overall look like? Yeah, I see the picture something like that. But basically, it is like our system is like, like beyond 95% Julia. So it's almost entirely Julia. They're very small pieces are in C or C++ Julia interfaces with C++ really well. The people are working on a Julia library. And that is what we use basically. And so that's also separately available to people who use Julia. Of course, Doug DB itself is written in C++ actually now, but I assume a C++ but they begin interface for that perfectly fine. And yeah, that's basically it. So it's a single process. So we have a buffer pool like buffer pools or cross processes is not amazing. We have a pager that manages the memory. So yeah, that's the architecture. We're working on distribution. It's in the lab. So that's ongoing. Thank you for your question. Yes, my question. Yeah, so I was wondering that how you partitioned the data across nodes and particularly when you were your correlated sub query. So every node has to reach the other nodes because you're kind of navigating. So it's an n score problem. Yeah, well, so that's a good question. We don't have enough experience yet to really know how without system is going to perform for some use cases. Like it is true, of course, that a graph workloads are challenging to partition and distribute. There are very specialized systems that do that extraordinarily well. Like let's say they're specialized triangle counting algorithms that distribute really well, right? So we will probably continue to innovate here. Like I'm going to go back to Kot and say we can keep improving the system separate from your logic. But realistically, I do think there will be workloads that will be performing better or worse in an out of force setting. So the correlation is like the correlation is based on the period of divisive choices. And it can have awareness of this. So I do think we can manage some correlators and queries pretty well. But yeah, and because the data is immutable, you can have multiple copies, right? There's not one note that owns anything or something like that. So like you can replicate the data. So if it is only a replication problem, then that is not an issue. So how big is your database here? Is it like a terabyte, 100 terabyte, petabyte? What is that? Well, still remains to be seen what we're aiming for really. I think we're definitely aiming for terabytes. I have no problem better petabyte database if there are videos and images in there, which is commonly the case, of course. Again, in principle, the app, we only care about the data that's actively being used because it's all sitting in a city and storage and index. So to some degree, the sky is the limit. But I think we're still early days in scaling up to very light data sets. So we have to see what we can handle really. Got it. Thank you. My last question is like, again, I think the architecture diagram would be helpful. Are you guys trying to be like the data warehouse on record, or are you pulling things from Snowflake or Redshift? Or like, how do we think about what it is? There was one little picture on that. That was the pairing thing. So we're trying to complement the warehouses right now, right? Because they do a great job at having your data and retrieving it. And so we replicated the CDC into our system. And then also go back. And we are focusing right now on workloads that they do not support. And a quick example, what's something you guys can support that larger blocks couldn't support? That's a good question. Well, so some logic blocks was not cloud native in a sense because it was built on premise. So it couldn't still out to larger nodes. We also like really the biggest difference is probably the language. The language is very expressive in general and library stuff and all the kind of stuff. It's a completely different programming experience, really. So the language and cloud native, I would probably say.