 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySleep call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Welcome to another vaccination debate seminar talk. We're excited today to have Pavel Kunov. He is the vice president of recent hand design at StarDog, a graph database that's been around for a while. So he's here to talk about query optimization, which is something I care a lot about. So Pavel is in Germany, so we appreciate him staying up late with us to talk about databases. As always, as Pavel has given the talk, if you have any questions for him, please unmute your mic. Say who you are and fire away and feel free to do this at any time because otherwise he's talking to a blank screen and it's boring and not fun. So Pavel, thank you so much for being here. The floor is yours. Go for it. Okay, thanks a lot for having me. So we will talk about StarDog database system, particularly the query optimizer, particularly the join order and the cardinality estimation stuff. So just a quick overview of what's going to happen next. I will present the general overview of the StarDog. And then I will have to talk at least a little bit about the data model and the query language. So this is something like if I was talking about the SQL system, I would completely skip that part and just go straight into the internals of the system. But since this is a bit of an unorthodox setup for many of you, I will have to at least explain how the data is modeled and how the query language looks like. So you understand what we are dealing with. And then we go into the details of the StarDog query optimizer, how it makes query sufficient. A lot of which is going to be about the join order and trying to select the best way to join graph patterns. And that depends heavily on statistics. So StarDog is a cost-based optimizer and the cost model is based on statistics. And so that is the key part of the system. But before that, I want to just say a couple of words about the company and myself. So again, we are not the most well-known company in the space, at least in the database space because StarDog is actually not marketed as a database company. We are more of a data integration, enterprise data integration company. We are US-based. We started in 2006, a long time ago. But StarDog as a product started in 2010 when we actually realized the market for the data is pretty big. And we didn't have a good product in the graph database space to use. So we started our own. The first release 2011, current version 7.9. It's a pretty mature system. It is used in a bunch of big companies, so it's not really a research prototype. It does make us money and used quite a bit. Myself, I do have a fancy title, but what I am is actually an engineer. So I lead the queer engine group at StarDog and doing it since 2011 straight from my PhD. Our group is responsible for everything related to queer performance, at least in the core part of the system. So every slow query which a customer cannot get fixed themselves, they will, that stuff will go to support and in many cases to us. So we get to see a lot of slow queries and spend a lot of time trying to figure out what happens, how to improve the queer optimizer. As Andy already said, my PhD was an automated reasoning in the UK. And before that, I was a software engineer at CERN, the nuclear lab in Europe. So the job itself was a lot less exciting than the environment that still a time will spend. So what is StarDog? If you go into the web page, StarDog.com, try to figure out what it is, that's not going to be extremely simple for you to do. The way the system is presented is a platform for enterprise data integration. So aimed at bigger companies with lots of data sources, lots of databases without a single unified view of the data. So this is where StarDog comes in. But this is not what I'm going to talk about. At the core of the system, it's database with some well-defined components. Just to be precise, at the storage level, it's ROG's DB database. So that's a key value store for Facebook at some point from level DB, Google, Google system. Key value store, LSM trees based. And StarDog uses it since 7.0. Before that, we had homegrown indices. It's actually pretty amazing how much mileage we got from those. We went like six major versions of the product with B-trees, which we did ourselves in 2010. So the data model is basically graph. I will talk a bit more about that. What kind of graph it is. The query language is called Sparkle W3C standard. It is transactional with full ACE guarantees, snapshot isolation implemented via multi-version concurrence control. So it's... So writes a chip, reads have to filter out like uncommitted transactions and stuff like that. So I will only talk about the single note at this talk. StarDog does have a cluster, which is the high availability cluster for resilience. So every replica has a complete copy of the data. Transactions are strongly consistent. And then there's a bunch of stuff for data integration, data visualization, to help StarDog achieve that goal of unifying enterprise data again, completely outside the scope of this work. But it's actually... What is important is that the need to support a bunch of external data sources does affect the design of the query language quite a bit. Even though I'm not going to talk about it. So this is one reason we love the so-called Volcano model for query relation because it's super extensible, even though it's sort of old, a little bit out of fashion these days, but the extensibility is one reason we're still using it. In terms of programming language, it's hybrid, Java C++ systems. So the storage layer is C++, ROXDB, and a layer directly on top of that to efficiently move data in and out of ROXDB. So that's C++. The rest of the system is written in Java, including the query language. OK. So RDF, like I don't expect many of you to know a lot about RDF. And so what RDF is basically graph data model. If you Google it and go to like W3C spec, you'll see a bunch of stuff about like the web and the IRIs and the blank nodes and, you know, XSD. None of that actually matters a lot for this talk. And also you'll find the formal semantics for the language, like what it means for two graphs to be equivalent, for example. So that stuff scares a lot of people, but it doesn't matter too much for the usage. And it certainly doesn't matter today for us to understand how queries are optimized. So what RDF actually is, it's a bunch of graph edges, written as triples with three components. The subject, predicate, and the object. So for example, in the first triple here, you have John as a subject. He knows which is a predicate, Mary, which is the object. You can draw it as a graph edge. And that's pretty much it. Or you can store it as a single large table with three columns, subject, predicate, object. It really is as simple as that. So every node can be either an IRI for the web, or it can be like a number or a string or a date, any of the XML schema data types. One thing I want to sort of say explicitly up front is RDF is really all about relations. It is relational data. It's not stored in tables, but it doesn't make it somehow non-relational. It really makes me really uncomfortable when people sort of try to explain that you have a relational data and then the graph data is somehow not relational. That is usually not helpful. We are dealing with relational structures here, which is not storing the tables, or we can store the tables and still expose the graph query language interface on top of that. For example, this triples here, John knows Mary, John likes beer, John has a certain age. It's a bunch of facts. You can write them as relational assertions in whatever relational schema you want. Another thing which people say a lot when they talk about the graphs or RDF in particular is like RDF data has no schema. That is incredibly confusing and misleading in a bunch of ways, especially given that you can Google a little bit further and find that RDF actually has a standard schema language called RDFS, RDF schema. You can describe things like the domain of a certain relation that knows is a person. If you know something you're a person or the range of another relation works at is an organization. If you work somewhere, that somewhere is an organization. What we mean instead, at least what I'm going to mean today when I know schema is that there's no rigid schema, so there's no need to define relations before changing the data. The way people generate RDF data is not like in the sickle world where you first need to define tables and then you fill them up with data. Typically, you just generate some data and then you massage it into some kind of structure. There might be data which is conformed to the schema and there might be, and you use the same syntax to describe both the data and the schema. There's no separate DDL, so to speak. The query language semantics, the sparkle semantics does not make any use of the schema. It does not assume that the data fits the schema in any possible ways. For example, you can have something meaningless in your data like Bob works at 42. 42 is clearly not an organization, just a number, but that does not make the query results somehow ill-defined or something. The evolution semantics will cover that case as well. What this means for the query language is that you cannot take shortcuts just because data is somehow messy. You still need to execute the query according to the spec. So a couple words about sparkle now. So basically what sparkle is, it's the first of the conjunctive queries over RDF. So that's the core of it. On top of that, there's recursion, there's some other stuff, filters, distinct projections, whatever. But the core part is the first of the conjunctive queries over RDF. So for example, this query here, it's from one of the standard benchmarks for sparkle. It queries for all the creators of both journal articles and proceedings papers. So the first two, so-called triple patterns, they, sorry, they select articles. The second to select proceedings papers, they have the same creator. So the same variable is used for the creator. And then we just get the name of the creator. You can write it as a data log program. And that would have essentially the same semantics. The sparkle semantics is well-defined via so-called sparkle algebra in the periscule-like way. So if you take this query, for every query, the spec defines the canonical algebraic expression. So in this case, there would be the basic graph pattern. And then there would be projection of two variables and then distinct filter on top of that. So it's bottom up. So the inside out, so the nested, the deepest nested sub-expressions are evaluated first. So the rest of the talk, I really will be talking about the BGP part. What's the basic graph patterns? The rest is processed pretty much like in SQL. There are very little differences somewhere now. Autor joins but not much. All right. So we'll talk about the basic graph patterns. So let's consider that graph pattern we had previously. We have articles. We have proceeding publications. We need to be created by the same person. And we want the name of that person. So how sparkle specification defines what this actually means. So the job of the spec is to define what the query results should be for every query in every instance of the database. So informally, if you represent this query as a graph, and the way we do it is basically just writing every variable so the thing which starts with the question mark as a node and every triple pattern as an edge. So we have one edge for every line here. If you draw that as a graph and then you essentially compute the mapping of variables to real nodes in the graph. So the number of all those instance instantiations will be the results of the query. So for example, here, this part of the query selects articles with like the type should be article. So that's the that's the node in the graph. The article itself is available and the creator is available. The other part of the query selects the preceding publications. They should have they should share the same node. That's the joint condition. And then we should have a name for that person. So a bit more formal is defined as sub graph homomorphism. But importantly, the spec does not say anything about how you can compute that homomorphism or that substitution of variables by values. That's completely up to the query engine. So some people so some query systems do like traversals. They decide to start at some note in the graph and then traverse the graph and various directions. What start up does it maps every triple pattern to the index can and it joins the results. And then the root of the joint tree represents the end result. To the basic graph pattern. What this means for the design of the system is that it's a lot of joints. So that's like in SQL. I don't know what's considered a lot of joints in SQL world these days, like five tables, 10 tables. Is that like I. I have never heard. I think it was some FAP was 1500. 1500 joints. 1500 tables joined together in a single query tables. Okay. Okay. Yeah. That's that's big. That's big by sparkle standard as well. I just asked because I sat through some SQL query optimizer talks where they would say things like if you have more than five tables in the query, we would switch from like cost based optimizers to like heuristic optimizer or stuff like that. So that's the same thing. So we have to, we have to, we have to switch from the cost based for the sort of tree search versus to a genetic algorithm after 13 tables. Okay. Okay. Yes. That, that kind of stuff. So we have that too. But the thing I'm trying to highlight here is the number of those little triple patterns in the query. That can. Be very big like hundreds is not. Too uncommon. They don't necessarily have to be triple patterns. So there's a number of, of joins. So I'm not talking about the size of the search space for the joint order yet. It's just the number of the base patterns to join can grow very fast. So that impose a certain restrictions on the design of the, of the query optimizer. So there is very little chance of systematic, you know, exploration of the search tree. You need to cut. Yeah. Yeah. So talking about how there was the RDF schema. But like you, you load RDF data without the schema, right? Yeah. So I guess my question is what percentage of the start off customers actually come with the schema? Cause like the Mongo guys added a schema, but from what I've heard that nobody actually uses it. So that's the point I will. Make later. Which is that most of the graph data sets do have schema. The problem is that, and, and, and the, and the query optimizer is allowed to use the schema for optimization purposes. Like you can, you can look at the schema information and decide, okay, so this pattern is not going to match a lot of results. So let's evaluate that first. So that's fine. But what it cannot do it, it cannot assume that just because the range of the predicate is an IRA, there cannot be numbers there. So that part you cannot do. But a lot of what the, our system does and some other graph systems do when they compute statistics and basically they're trying to figure out the schema, whether the, whether there is some, some schema, maybe not explicitly represented and whether the data actually conforms and to what degree. So the joint order is critical. So for example, if we add a little triple pattern here that we are only asking for people who work at Stardoc. So our engineers do, do publish papers, but you know, not as many. So that would be a selective pattern to evaluate first. So that would be some, somewhere deep in the joint tree. So as I said, the joint order optimization in Stardoc is cost-based optimization problem. Every joint order corresponds to some algebraic expression or queer plan. There's very little difference in Stardoc between those two concepts. Every queer plan has an associated cost. The joint order optimizer tries to find the plan with the least cost. And so that's, I believe, is a very standard setup for this problem. In practice, however, it's not just, so the, the search space is not just big because of the space of all possible joint orders, even though that itself will give you like a factor or something. But also you need to pick the joint algorithm for every joint. And there, you could be multiple, like in Stardoc, there are mirror joints, hash joints, nested loops, usually bind joints, something else. So that blows the search space up a little bit. And there are also constraints on when you, you cannot use the certain joint algorithms. For example, the mirror joint algorithm obviously requires the inputs to be sorted on the joint key. So that makes the search space bigger, but also imposes some restrictions on it. So here, for example, on the left, we have one joint tree with a mirror joint, which computes all the articles and creators. And then it has joined later with people who work at Stardoc. So this is a bad plan because, you know, the number of all articles which are going to iterate over is going to be very large and most reasonable database. And after that, you will get to all the employees of Stardoc, which is a small set. And on the right, you have a better plan where you would first get all the, all the nodes, like all the documents created by Stardoc employees. So that's going to be a small set. We can sort that in memory. We'll take milliseconds or even less. And then we can see essentially to the type check, make sure all of these nodes are articles and output the result. And when I'm talking about like the joint, like the joint nodes here. So these are logical plans. These are not executable operators. So all the optimization is done on the logical query plans. So, so this is my experience. Like, but I think it generalizes at least in the graph space pretty well, like the reasons of a poor joint order. And by the way, there are certainly a way for a query to be slow, even with the perfect joint order. So you can get that part right, but still the query is slow because you didn't push the filters or something else. But conversely, if you, if you screw the joint order, like there is, in most, in most cases, there is no, there is no come back from that. You're done. The query is not going to run fast. So that part is absolutely needs to be done sufficiently well. And so assuming you have a query which is slow in the joint order is bad. So the reasons for that would be first, the search space is big, you fail to explore, just fail to find the right joint order in the very large space. Second reason, your cost model is bad. Just assign the wrong cost to the plan. You have a plan which appears more expensive, but faster than the one which appears cheaper. And the third reason is the cost model is okay, but the input to the cost model, which is cognitive estimation is bad. And by the way, I know that some people present this part a little bit differently. Like they might consider statistics a part of the cost model. So the cost model is a function which is takes the query plan and outputs the number. In our setup, the cost model does not depend on the data at all. It takes the statistical summary of the data and the query plan and outputs the number. So if the cognitive estimations are bad and as a result the cost is bad, we don't consider the problem with the cost model, consider the problem with statistics. So that kind of helps to separate those two things and makes it a bit more manageable. So anyway, so the first reason is kind of inevitable. So it's an NP-hard problem after all. So in your worst case, you will always hit it eventually at some point. But it is rare. So the search algorithms are pretty good for exploring large search spaces. The second problem, the bug in the cost model, which is not related to statistics, is usually an easy fix. Nothing is easy in the database world, but usually it does not take too much effort to fix. And also, as your system matures it happens less and less often. I remember at some point it was happening quite often, but then we have regression tests for all the cases where the cost model gets it wrong and now it doesn't happen a lot. But the current analysis mis-estimations when they are wrong and the cost model basically waves hands and says, okay, that's the best cost estimation I can give you given what you tell me about your data. So that is unfortunately where I spend most of my time on and that's what keeps me up at night because it's a major research problem. And also what I'm going to talk about for the rest of the talk. So, okay. So here's the basic problem. We have a graph pattern. Maybe we're trying to find the best joint tree of something else. The basic question is how many results that graph pattern is going to match when we evaluate it on our data. So you need to answer that question somehow to tell the optimizer how to optimize your query. But there are some obvious constraints. Like you cannot just run that pattern against the data and out with the perfect estimation because in the worst case there might be a non-pollinomial number of chronology estimations during the joint planning phase. So each estimation must like really, really, really fast. And some of them repeats. You need to cash them and so on. So it's clear that you need to do something. Can I just ask for clarification, please? Sure. When you say cardinality estimate, are you including things like whether different attributes are correlated and things like that. So it's not just the cardinality of the raw data. It's cardinality of any intermediate result as well. Yeah. So when I say the cardinality estimate I mean that we need to know how many results a certain pattern produces. But that pattern can be joined to other patterns like A and B. And then of course we need to know the correlation of A and B, like how often they occur together. Otherwise if the database system doesn't have anything in statistics about correlation and just assumes independence all the time then it's at least in sparkle space there's not much you can do. Right. So okay. So it's clear that we need to pre-compute something. But you don't know the schema and maybe in some cases there's no explicit schema you need to figure out from the data. The queer workload may not be known in advance. And the data can be large like billions of edges like 10, 15, 50 billion edges. That's what I usually mean large for a single node. I'm talking about single node here. And it can change. So when StarDoc started nobody cared anything about like nobody cared too much about the velocity. How quickly the data changes. StarDoc was motivated by the system called RDF3x that's Thomas Neumann RDF implementation. And that was like in 2011 the world was okay you load a bunch of graph data you index everything and then you run a bunch of complex queries and the data doesn't change. If it needs to change just wipe and load the new data and do it again. Under those conditions your life is a lot easier because there's a lot you can pre-compute. These days, which is one reason we changed the backend at some point this is no longer the world, the data does change and statistics needs to adapt. Okay. So what we actually pre-compute and how we use it that's what's going to come next. So one very common pattern which occurs in pretty much every sparkle query is the so-called star-shaped sub-graphs. So when you have a node in the middle and a bunch of outgoing edges like you're querying all the information about like an article it's type, it's creator, it's year when it was published the proceedings volumes stuff like that. The query shapes like that are super common. Why are they common? The reason they are common is because that's how people tend to model business objects like a node with a bunch of attributes and also because these structures are essentially directly reflections of tables. So you can model this data as a single table for publication with a bunch of columns, every column corresponding to the outgoing edge like the publication would be a table with four columns and a lot of the graph data was a tabular data at some point and then for whatever reason people decided to move it to the graph. So how do we compute that? Those subjects shaped graphs or SSGs we can do it in just one pass over the index which we have where every edge is sorted by the subject node. It's one of the indexes which Stardo has if you do a single pass over that index you can compute all those SSGs. And the really interesting thing here which was discovered by Noemon and followed it multiple times in my experience as well is that the number of distinct star shaped patterns is extremely low even when the data is extremely large. You can have a graph data set of billions of edges and have 50 or 100 maybe 150 different patterns like that where the difference is in terms of the predicate labels. And it's very interesting to think like why that is the case because if that is not the case you can always find the situation when it's not the case. Then a lot of things like start falling apart you need to do something else. But that is the case because most data sets they do have some schema. Even though the data may not always conform but essentially what is happening here is you're trying to figure out those tables which were originally created for that data when the data was first modeled and then moved to the graph. A lot of the graph data sets were, as I said, tables at some point. And the truth is that in real life most schemas are compact even when data is large. So since that patterns, SSGs, they can be maintained in memory. They are very efficient. Data structures like tries, essentially graphics trees. So we can query that during kernel estimations, very efficient. And that idea again goes back to Thomas Neumann who said paper for RGF. I might name Thomas many times during this talk by the way. My relationship to his work is really complicated. I'm both fascinated and a little bit annoyed because I'm fascinated because we managed to take advantage of a lot of his stuff, but annoyed because when you think of something smart, either you find it in some Thomas paper or it doesn't work or it doesn't happen always but that's sometimes the impression I get. Thomas, in my opinion, one of the world's greatest database researchers right now. What German of all the Germans do amazing? It's intimidating a little bit, I find it. But also, like, there's a lot of stuff. He writes a lot of code, he writes a lot of papers and he's got three kids. Yeah, it's crazy. I know, but a lot of things he does also, we couldn't really implement easily. But anyway, I take this. So if we look at our query, we'll see that, okay, we have those subjects shaped graphs, but there's a complication here. One of them attributes is not available, but is actually a concrete value like the so-called bound objects. We're not just asking for all things which have some type, but they need to have the type article. So we need to account for that in chronology estimation. So we know the number of, you know, the subgraphs of this shape in our data set, but we don't know anything about article, but how many of them are articles. So what Neumann says is basically use selectivity of this constant, given the label of the predicate and assume independence with creator. So that's the kind of standard assumption in chronology estimation area. Like if you don't know anything about correlation, assume independence. And that works pretty well, but this is one area where we can do better. I sincerely believe we can do better than the chronology than the characteristic sets paper written by Neumann and I'm genuinely proud of that. So this is where the count mean sketches enter the picture. And I don't know if I need to talk a lot about what those are. Like I feel you must have talked about this stuff before me, so I will be very brief here. They are like, if you know how blue filters work, they are similar in a certain way. Like blue filter sub probabilistic sets and count mean sketches of probabilistic multi sets. They give you the approximate frequency for an object in a data stream. So the blue filters, they are sort of one-dimensional. The count mean stages, they are two-dimensional. You have multiple hash functions. So when you consume the stream you apply the hash function like four different hash functions in this example. You increment like four counters for every object in the stream. And then when you query, when you want to know how many times a certain object occurred in the stream, you just take the minimum over all these four rows. So the fascinating thing here is that it's like it's an extremely compact data structure. It's more compact than blue filter. Blue filters are linear and this one is sub-linear. I can talk later why that is the case, but you have like most count means sketches in Stardoc. They are four hash functions and they're always like 1000 24 1k long, like 1000 64 bit integers, which means the total footprint, the memory footprint is like 32k. And I told you before that the number of those patterns which we pre-computed is small. So you have like 50 of those, maybe 100 times 32k, that's pretty much nothing. You can keep all that in memory for your estimations. And that image is credit to Hetron Coila. So the cables, the count means sketches, they overestimate, they never underestimate, and they are afraid for detecting the so-called heavy hitters, the objects which occur many, many, many, many times in data sets. What they cannot do for you, they cannot distinguish between objects which occur rarely, versus the objects which do not occur at all. But that is fine. We can live with that. The reason we can live with that is when we're trying to estimate the cardinality of the pattern and we cannot distinguish between the pattern and not matching anything in the data and matching 10, up to 1000 results, we don't care. Because in any case, that's going to be a very quick, fast pattern to evaluate and you can do whatever you want with its results, like sort and memory. It will be fast. So the weakness of the count means sketch, by and large doesn't matter. So what happens is we augment every subject-star pattern which we pre-compute with a sketch. So that's the improvement over the characteristic set approach. The interesting part here is we, like the those bound objects, they can occur at the end of any predicate, but we only compute one sketch per this star-shaped pattern. Why that works? That is maybe not obvious, but the reason this works is because in real life, most property ranges are disjoint. If you have a business object with a bunch of attributes, yes, you can always imagine the case where you have attributes with overlapping ranges, but in most cases they are more or less disjoint. So we can get away with only one sketch per SSG. Then, okay, we're not done yet. We can estimate the star shape, but then there is a join on the object variable. We're trying to find articles and proceedings, papers written by the same person. So there is a join here, and we need to know how many results that matches. Of course we can do the same thing again. We can pre-compute star-shaped patterns, but not for outgoing edges, but for incoming edges. And I thought about that a number of times, but that seems like an overkill because you spend just as much time doing it, but they are substantially less common. So what we do instead, we use average indegree for the creator predicate, and that seems to work reasonably well. And that also takes less time to compute. But then the most problematic part is the so-called frequent chain estimations. And now it becomes a little bit similar to all those various data mining approaches, like mining frequent patterns, that kind of stuff. So the problem here is to estimate how many of those publication creators actually have a name, we need to know how frequently this chain occurs in the data. Creator and then name that pair of predicates. And then at this time, we have to assume independence between the estimation of those star-shaped sub-graphs and the chain estimations. So the question is how many create-slash-name chains are in the data, and it's essentially means that we're trying to figure out the foreign key relationships in the graph data. Like in the SQL world, that would be the foreign key relationship, but without that being explicit in the schema, we have to figure it out from the data. And in this case, you can see that the name is usually functional in most reasonable datasets. So the statistics will figure that out and we will use that for estimation. Every person has a single name. So just as with star-shaped patterns there are lots of graph chains, the binary chains in the data, but not so many frequent ones. So the interesting things you need to figure out from the data, you need to analyze a lot of data, but the result of your computation is usually compact. So that is true across all the statistics computation that we do. And we can pre-compute all frequent chains in the graph data via essentially one join over the full database. And we're trying to improve that by looking into sampling so that it would take a sub-linear time rather than linear time and then we discard the low frequency chains. And again, when we have to deal with the objects being bound to constants on the right hand side of the chain like for patterns, like give me all articles written by Paul Erdos. So now we have a constant on the right end. We do the same trick. We use the count mean sketch for all chains, which we like all the frequent chains, which we discovered in the data. So now putting all this estimation together, how it actually works algorithmically when we have a queer pattern like this. So the first thing we do is we actually so this on the left hand side, that's your query as you would reason about it as a human because that's how it resembles the data. On the right hand side is what the query optimizer actually uses. It's so-called join graph. In the join graph, every node is a triple pattern and they are connected by an edge if there is a join condition between them. For example, the join condition in sparkle here would be the shared variable. And then the rest of the kernel estimation logic is basically covering the join graph with those statistical summaries we should pre-computed before. So for example, we look at this with the article, it's a subject in both patterns, which means it's a star-shaped pattern. Let's use the SSG statistics for that. Look, there is another star-shaped pattern. We also must have the statistics for that. So that part is covered. Then there is a join edge between those two which is the join on the object variable. We don't have the star-shaped statistic there but we know something about this predicate namely the average in-degree, so we're going to use that. And then finally there is only one node left and it's a chain join because the object, because the shared variable creator is in the object position on the left and the subject position on the right-hand side, so we're going to use the chain statistics for that. So what you get in the end is essentially the minimal spanning tree to cover. You cover your query, the join graph with available statistics for your data in some deterministic way until you get the minimal spanning tree and then you compute the overall kernel distribution for the pattern. Basically you multiply the kernel estimation for every node times the selectivity of each join edge here and you arrive at a single number for a pattern. I think I'm done for the most of the content but I want to talk about stuff which doesn't work all that well not just about stuff which works great, which is like count mean sketches, I love those, I can spend like a whole hour just on them but I also want to talk about things which don't work that nicely. So the biggest problem in all this space is you get a pretty accurate kernel estimations at the lower level of the query plan. When you join like individual triple patterns, for example here we would have a nearly perfect estimation of the number of articles and creators in the database and the same for the proceedings papers but we will have less accurate estimation on the next level when we join on creator and then possibly even less accurate on the level after that when we get the names. This is going to be accurate because it's a functional predicate but in general the more you go up the query plan the more estimation errors accumulate and since the query plan can be deep it can be both bushing and deep and so the errors might accumulate so what looks like a reasonable estimation at the lower level might look might be a couple of orders of magnitude at the root of the plan. So that is the biggest issue here and it's difficult for the query engine to know at which point it can no longer trust the kernel estimations so in addition to the basic kernel estimation framework which most systems seem to have these days the query engine needs things like safety nets for example what happens if you when you thought your hash table is going to contain a thousand results you start computing the hash table for the hash join and it turns out it's a couple million results and you don't have memory for that. You cannot just crash the system so what StarDog does it starts spilling data to the disk which means the query gets lower but the server remains operational you can terminate the query you can hit the time out you can operational operational nice around that another thing which is useful is the feedback loop what if the open problem you spilled a dip there's nothing you can really do yeah but um um the open problem is how to detect that the kernel estimate how do you figure out the bounds for your misestimation how wrong it can really be because if you because depending on the place in the query plan it may or may not matter too much for example if your estimation let's say is 10 and you it can be two orders of magnitude wrong that means your bound result from 0 to what 1000 so that's not going to matter because you know as I said computing the hash table of 10 results versus computing hash table of 1000 results no practical difference if your base estimation is million and you still can be two orders of magnitude off that's a substantially more risky situation and there's no much reasoning happening around that there are join algorithms which you would not use for example if the risk of kernel estimation is particularly large my favorite example is the bind join which can be absolutely amazing in some cases and probably disastrous in some other cases like that's the join algorithm I would not use if even if the things look okay but I know that the kernel estimation could be very wrong here so that's so that's good thank you okay but then okay so still into this is good but then the problem there is it's difficult to account for things like the amount of memory in your cost model like you don't know which other queries are running in the system you don't know if you're going to have enough memory for this hash table you don't know if that hash join is going to spill or not the feedback loop that's a great thing in some cases the queries can detect that it made a mistake in the past it starts computing the hash table thinking it was going to be 100 elements in the end it was million it can actually detect that information for future queries unfortunately it's not going to help you to save the current query but maybe you will not make that mistake in the future but what's really interesting and where we don't do enough and we should do more is like sometimes the query engine really needs the plan B when it sees that the query execution went horribly wrong like you start computing hash table and you see that you're not going to have enough memory and the estimation was bad can you like what some systems do they can swap the order of operands of the hash join that creates some other troubles because it might impact the order in which the results are coming out of the hash join and that in turn impacts the further you know joins up the tree so I think a better approach which we're looking at is instead of re-optimizing while executing like swapping operand it might be I personally think it's more interesting to look into incremental executing the query while you're still optimizing like your kernel estimation is risk at some point you see that maybe you can force spending a little bit more time executing that little part of the query and use that information to optimize later or maybe just rely less on kernel estimations and I know it sounds weird because I just spent like a whole hour telling you how the kernel estimations are important but this is this query which this question bothers me a lot like maybe we are relying too much on kernel estimations maybe maybe just at some level of the query plan you just need to switch to something else problem is you don't know when your kernel estimates start being too imprecise for the for the test and then finally we have previous talks from Databricks and Snowflake to tackle this exact problem but like how much query planning can you do without cardinality estimates oh that's interesting yeah I saw some papers on like trying to figure out the bounds and what you do later like how far we haven't been able to put any of that into the work in production but can I just point to one interesting direction which has sort of started being explored is robustness of the optimization so rather than trying to find the best you try to find a plan which is not so sensitive to mistakes yes absolutely this is a great point and that's one of the issues in StarDog is when it gets things right the queries are very fast but if it gets things wrong then there's no you cannot really say how bad it's going to be it can be terribly bad and so yes robustness so just wrapping up I was only talking about the very basic part of Sparkle which is the basic graph patterns there is a lot more stuff which makes estimations difficult so for example recursive patterns you have things like transitive properties how many people you know if you know someone if you know John and John knows Mary you know Mary so the reachability kind of estimations that's a huge issue estimations for query patterns which go out in some other system like in the data visualization in this case that's a whole next level of complexity because that system now needs to tell you something about the data which resides there and there is no common API or interface how you can you know query for kernel estimations for in a separate system most systems do not expose that over any kind of standard query language I haven't talked much about computational issues for statistics like how often we do that and so the standard setup is to that when the more than 10% of the data changes I haven't talked about things like sampling which I fully believe we need to do more to make the kernel estimations a little bit faster so there is a bit of a problem in the graph space when it comes to sampling sampling from the graph in such a way that the sample has the same structural properties as the original graph so that is not straightforward at all for example you can read the same frequent change from the sample as from the original graph that's difficult so we are looking into both sampling and summarization so the graph is compression sample instead of sampling incremental computation and then a bunch of other things I don't have time to talk about which we do like debugging query performance problems like when the query is slow you want to know which join is actually bad like which join computes more intermediate results than was predicted by kernel estimations we can do that with a query profiler which collects that information and that's usually useful and then performance testing the cost model testing there is a framework for that and then I will just throw this stuff in the very end so maybe maybe we don't even need to do all these fancy statistics collection stuff all we need to do is to train machine learning model which will figure out the data properties and accurate kernel estimations for the job like there are a bunch of papers on that topic which every year I see at least several in this space I have a backlog for maybe at least 10 or 15 papers and every year I'm making this new year resolution to look into that and maybe that will simplify everything and just win decisively and it just never happens anyway so that concludes the presentation thanks a lot for the attention like we are hiring if you have any questions later you can always contact me I'll help you to chat about any of these things about things which work and even more so about things to not work thanks a lot I hope that will happen Pavel, thank you so much for being here we have time for one question from the audience if I know what to fire away Hamid said in chat he's got one oh Hamid go for it okay so the question I have is oh we lost him, you muted yourself yeah I think ask him to unmute got it so the question I have is what such databases are we really dealing with is it like a terabyte or 100 terabyte or petabyte and in that context how do you compare yourself with the Cambridge semantics oh I will probably not be the best person to ask about the comparison with Cambridge semantics like ANZO people so we have other people in the course maybe everyone you know something about how we compare to ANZO yeah I mean to answer the size question we are mostly talking about terabytes data not petabytes that was the first part I think in comparison to Cambridge semantics I mean there's a lot of overlap with respect to RDF and Sparkle but one thing they have is like an in-memory parallel query engine which is quite different than what we are dealing with like ROXDB disk based indexes like we don't use as much memory as they do there are different challenges and so on and if you look at as Pavel mentioned at the very beginning if you look at the problem space of data integration there are lots of differences at that level as well but the main difference for the graph database or what we Pavel talked about today it's mostly like in-memory versus not in-memory I think I don't know anything about graph optimizers you told me about this stuff but I don't know like Neo4j or like TigerRaf Neo4j has the call space optimizer is there anything about that they do that you don't do that you want to add or if you guys are purchasing the same problem in similar ways so the query language which Neo has I'm not so much familiar with TigerRaf query language but Neo4j query language and a lot of things which they do are similar like Cypher has similar semantics to Sparkle it's a bit different it's graph isomorphism homomorphism but it is declarative and when I talk to their engineers I get the impression that they do a lot of similar stuff I never really felt like we need to do something maybe they didn't disclose enough talking with me that we need to borrow something from Neo4j but I suspect they do more on the parallel query execution for like large all-up style queries than we do because I know that's one of the place where we can do a better job than we do currently but certainly not in the kernel of the estimations area