 The Carnegie Mellon Quarantine Database Talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. Welcome to another Quarantine Database Talk. It's a beautiful day in Pittsburgh. Becca is in Chicago. It's also a beautiful day there, but it's still a pandemic so we're stuck inside. So today we're excited to have Dr. Rebecca Taft from Carpers DB who is the, what is your official title, Lead Architect? No, just a member of technical staff, I think. An awesome person at Carpers DB. Becca did her PhD with Mike Snowbreaker at MIT. She worked with me on some HR projects as well. So again, the way we want to do this is if you have any questions for Becca, please just unmute yourself, say who you are, and ask your question. And as always, we want to thank the Stephen Moy Foundation for Keeping It Real for sponsoring this event. Okay? All right, Becca, the floor is yours. Go for it. Cool. Awesome. Thanks so much, Andy, for hosting me and thanks everyone for joining. So as Andy said, I'll be telling you about Carpers DB's query optimizer. So for those of you not familiar with Carpers DB, it's a database that's designed with the mission to make data easy. As the name Carpers implies, it's resilient to disasters. So it's meant to survive up to even a whole data center or regional outage. It's also scalable, so it'll scale out as an application grows or scale up. Its geo-distributed capabilities enable data to reside close to the users that are actually accessing it most frequently. Support for SQL and serializable isolation makes development of old TP applications simpler, and it's also open source. Just to give you guys a sense of how customers are using Carpers DB, this is actually a real Carpers DB deployment of a company that has markets in Europe, Australia, and the U.S. And as you can see, by locating the nodes in each of those regions, they enable their customers to have low latency access to their data. Also, by storing everything in a single database, they make it simpler on their application developers because they don't have to worry about any manual sharding. The database just takes care of all of that and provides strong transactional guarantees. And finally, you can see it's extremely resilient because they can even lose an entire region and the application will be able to continue running. So in order to support these types of workloads, Carpers DB has a shared-nothing architecture and it consists of a distributed SQL layer on top of a distributed key value store. So in this talk today, I'm basically going to be focusing entirely on the SQL layer, but the KV layer is also really interesting, so I encourage you to stick around at the end. If there's time, we can talk about that or check out. There's lots of info online. We have a SIGMOD paper we recently published that goes into lots of detail about the key value layer. There's also blogs and videos online. But anyway, as you can see, for the rest of the talk, I'm going to be focusing mostly on queer optimization. So before I move on, are there any questions? I didn't go for it. You're good. All right. The chat questions were me, sorry. Oh, okay. Yeah, I can't see the chat question. So yeah, feel free to jump in. Okay, so queer optimization in Carpers DB. Since we support SQL, we need some way in order to convert that declarative SQL statement into a query plan that the database knows how to execute to return the results to the user. So this could be an optimizer or kind of a more basic planner. But you might ask, there's plenty of open source optimizers out there, like Postgres, for example. Carpers DB speaks the Postgres wire protocol or the Postgres dialect of SQL. So it's not a crazy idea to use the Postgres optimizer. But there's a few reasons we decided not to do that. First of all, Carpers DB is written in Go and Postgres is in C. So there's some amount of overhead when translating between those two that we wanted to avoid. Second of all, the execution plans are going to be very different. Postgres is a single node system. And as you saw, a Carpers DB database can span potentially thousands of miles. So a plan that might perform very well on a single node may perform very poorly when distributed across a large distance. And then finally, as you probably know, the difference in performance between the best plan and the worst plan that an optimizer can choose can be orders of magnitude. So this is a crucial component in terms of performance for a database. So as a database vendor, we wanted to have control over this piece of the system. All right. So Carpers DB's first optimizer was not really an optimizer. It was more of a heuristic planner. It used rules to choose the execution plan. And these rules might look something like this. If an index is available on the attribute of filter condition, always use it. But the problem is it's hard to write rules that are going to be universally good. And as Carpers DB was used for more and more different types of use cases, these rules got more and more complicated. And over time, started to look more like this. Always use the index, except when the table is very small, or we expect to scan more than 75% of the rows, or the index is located on a remote machine. So as you can see, this is getting kind of unwieldy, difficult to manage. It's very brittle. And it kind of works for OLTP queries, simple crud queries where you're just reading or updating a small number of rows. But the problem is that our customers actually will run everything. They'll run all different kinds of workloads, both OLTP and OLAP, when comparing us against our competitors, when deciding which database to use. And then even in production, they might actually run some OLAP-y style queries, say at the end of the day for kind of an end of day reporting, that kind of thing. So we wanted to make sure that our performance didn't suck when running more OLAP style queries. So by the time I joined the company about three years ago, the decision had already been made that the heuristic planner had outlived its usefulness, and it was time to build a real cost-based optimizer. So an optimizer basically instead of applying rigid rules, it considers multiple alternatives. It assigns a cost to each alternative where the cost is kind of a unit list value that basically allows you to compare different alternatives, sort of compare the relative cost of each of these alternatives. And then we choose the lowest cost option. And those of you who are actually familiar with query optimization, this is a, we decided to implement a cascade style optimizer with unified search. So if you're not familiar with that, not to worry, I'm going to give a lot more details in a second. So how do we actually generate these alternatives? We start with a default plan that can be kind of mechanically constructed from the SQL query. And we perform a series of transformations, which basically transform the query plan from one plan to a logically equivalent plan. And then we store all of those alternatives in a compact data structure called a memo. So if you've taken Andy's class or you're familiar with query optimization, you already know what a memo is. But if not, the only thing you need to take from the slide is that we have a way of generating these alternatives and drawing them efficiently. And then, you know, how do we assign a cost to these alternative plans? Well, there's a few factors that affect the cost. The hardware configuration, like, you know, which disk, which type of disk you're using, where are the nodes located, how is the data distributed across those nodes, what type of operators are in the plan, like, you know, select versus join versus scan, and then the number of rows processed by each operator. So I'm going to talk about all of this in detail. Any questions before we move on? So for the cost model quickly, like, do you do anything like the Postgres does where, like, they have like the initial cost was like a cheap approximation, and then there's like a final cost that, like, if you decide that you need to compute the rest of it, or is it like a single cost model that produces a single value? Yeah, we just have a single value right now. There's, you know, plenty of improvements that we want to make at some point, but for now it's a single value. And then I don't want to get too deep in the cascade's wonkiness, but like, do you do, can you do a cost model estimations on logical nodes, or does it always have to be a physical node? So we kind of merge the idea of logical and physical. So for example, you know, we don't really have a concept of a logical join. A logical join is basically a hash join for us. We kind of default to hash join. Okay. So, you know, we assign a cost to everything. Super interesting. Keep going. This is awesome. Cool. All right. So how do we generate these alternatives? This is kind of the flow of how we generate the alternatives in the optimizer. So I'm going to talk about each of these in detail. And for this section of the talk, I'll be working with this sample query you see here. So we have two tables, A, B, and CD. A and C are the primary keys. And then we have a secondary index on column B. And we'll be optimizing this query. Select star from A, B, join CD on B, equal C, where B is greater than 1. All right. So, start off with parsing. And I'm not really going to talk too much about parsing, because this is pretty much the same across most databases. We have a yak file that's very similar to the Postgres yak file, more of the grammar stored. So, basically, the parser takes the SQL string as input and outputs an abstract syntax tree. Like you see here, here I'm actually showing the specific go structs that are produced as part of the parsing phase. Next, we have OpBuilder. And the OpBuilder takes this abstract syntax tree from the parser. And its job is to call a bunch of nested functions, like you see here. And these functions will produce the preliminary query plan. It also performs semantic analysis as part of this process. So this is where we kind of check that the query makes logical sense. So even if it passed the parser, so it's grammatically correct in a way, it may not be logically correct. Like these English sentences you see here are grammatically correct, but they really make no sense at all. So in the context of SQL, in the context of our specific query, we'll be checking things like are A, B, and C, D real tables in the database that the current user has permission to read? Do columns B and C exist in tables A, B, and C, D? And are they unique? And do their types match so that we can actually perform this equality comparison? And what columns are selected by star? All right, moving on to normalization. For semantic analysis, at what point do you try to bind a, if it's a prepared statement, what point do you try to bind a prepared statement value to a type? To a prepared statement value. That happens, yeah, during the OpBuilder phase. We do kind of all the type checking during days as well. Okay, awesome. Thanks. Yeah, so normalization is actually happening kind of in parallel with the previous phase, the OpBuilder phase. And that's because these function calls that I showed you before, these nested function calls, are actually factory methods that have been code generated from a number of normalization rules that we've defined. So basically what happens is the OpBuilder thinks it's constructing kind of the canonical plan from the SQL query. So in our example query, we have the OpBuilder thinking that it's constructing this inner hash drawing wrapped in a select with filters be greater than one followed by a project. But what it's actually constructing is this plan on the right. You can see here, and that's because each of these functions, these factory functions have a bunch of normalization rules that will fire in the process of being called. So what has happened here, you can see is that we've basically used the fact that B equals C in order to infer that if B is greater than one, then C is also greater than one. And then we've pushed those filters down below the join onto whichever side is appropriate. So maybe this looks a little bit like magic right now, but let's talk more about how it works. One other thing I wanted to mention is that these factory functions are actually made efficient. So we don't actually even materialize this first plant, the OpBuilder thinks it's constructing. We only end up materializing the fully normalized plan. All right. So normalization rules are transformation rules. They're kind of one of two types. We have normalization and exploration, which I'll talk about a little later. And all of these rules create a logically equivalent relational expression to the one that they started from. Normalization rules specifically are kind of always a good idea to apply. So we don't even bother keeping around the old expression because it's generally going to be strictly worse than the new one. So for that reason, a lot of databases will call this phase of rewrite phase. And we actually have hundreds of normalization rules, but I'm showing a few examples here. For example, we eliminate unnecessary operations. So obviously not not X is going to be the same as X. It's going to be more efficient to execute. We don't have to execute not not X for each row. Canonicalizing expressions, we, you know, X equals five is not actually going to be more efficient than five equals X, but it's helpful to perform this canonicalizations with other rules. Don't have to check both ways. They can basically just match having a variable on the left hand side. Constant folding is another important one that you probably heard of. Basically allows us to avoid executing the same function over and over again for each row during execution. And it also enables us to fire other optimization rules later on, like for example, constraining an index. And then these last two items, predicate pushdown and do correlation subqueries are not always a good idea. I have this little asterisk saying, you know, they don't completely fulfill the normalization rules. But it's just easier to kind of treat them this way because there's very few pieces where they're not going to make a good idea. Those of you not familiar, predicate pushdown basically means we want to push the filters as far down the query plan tree as possible in order to limit the number of rows processed by parent operators. And do correlation of subqueries is if you have a subquery that refers to the outer query, we want to try to hoist that up into a join or some other type of operation that's going to be more efficient to execute. So what do these roles look like in your code? Are they just they look like patterns are matching the same way in the cascades as you go down or is this sort of some kind of separate syntax? Yeah, great question. So that is my next slide. We have this domain specific language called option, which is how we represent all of the transformation rules, both normalization and exploration. And even though this rule I'm showing here is very simple, this is this is the kind of the simplification of two nested knots into just the input of the knot. All of the rules basically have the same format. So we have a comment at the top saying what the rule does, then we have a header with the rule name and any tags that apply to the rule. This basically signifies that this is a normalization rule. Then we have the match pattern, which here basically any expression that has two nested knot operations will match the input. This is where the input variable is bound to the input of the knot. And then finally the replace expression below the arrow is when we actually return the input. And this gets compiled into go, which you can see here. This is actually kind of a simplified version of this rule. It's a little bit more complex, but we have this factory function construct knot. You might recognize this from the type of functions that the op builder was calling earlier. And inside this function we have all of the rules that apply to the knot expression. So this eliminate knot rule basically says if there's a nested knot when we're trying to construct a knot, we will just take the input and return it. If we actually get through this entire function and none of the rules match so we don't return early, then at the end of the function we memoize knot, we put the knot expression in the memo, and we return it. And again I'm going to talk about the memo in a little bit, but remember that is the data structure that stores our query plans. This is another opt gen rule, merge selects, another normalization rule. And the reason I'm showing you this rule is because it shows one other feature of opt gen, which is the ability to call arbitrary go code. So this rule is taking two nested select operations and concatenating their filters. So this concat filters function, if we actually look at the generated code, is just another function defined elsewhere. So this allows us to basically call arbitrary go code from the opt gen rules, which obviously there's some very specific syntax with opt gen that is going to be useful in general, but we can also call custom functions as well. All right. Any questions on normalization before I move on? I know that was kind of a lot. Yeah, I have so many questions. So how much is the DSL, like how go specific is it? Like, are there idioms of go that would prevent us from co-gening to C++, for example, or like, yes, there's the function called arbitrary go code, but you could implement that in C++ and that would then compile. Yeah. I think it would work potentially with other languages. I mean, you'd have to obviously rewrite the compiler. But yeah, I think the syntax would work. I mean, we don't take advantage of things like the multiple return values of go or anything like that. We haven't done that. Yeah, I'll think about that. I can't think of anything. Well, let's do it now. Let's see what happens. Yeah, sounds good. All right. So exploration. Exploration is the other type of rules. These are rules that may or may not produce a better plan. So unlike with normalization, where we just replace the previous plan with exploration, we're going to keep both alternatives around. So these are things like join reordering, selecting a specific join algorithm, hash join, merge join, lookup join. Lookup join, by the way, is sometimes called index nested loop join in other databases. I know that might confuse some people. Index selection is another one that is an exploration rule. So these rules use the exact same option syntax as normalization rules. They just have that explore tag instead of normalize. All right. So I can't really go any further without actually describing the memo at this point. Remember, this is the data structure that stores our forest of query plan trees. And the memo looks like you see here. It's consistent with a series of groups. Here, I'm just showing the relational expressions. Each group is a different relational expression. Join select scan. In reality, we also store the scalar expressions in memo groups, but I'm admitting them here just for simplicity. And this is showing the memo as it looks after normalization is complete. So you can see it matches this plan on the right with our hash join and the two filters push down. There's going to be one expression for each group at the end of normalization. As we explore, we're going to add more expressions to each group. And each of the expressions in a single group are logically equivalent. Another thing to note about the memo is that groups can refer to other groups. So for example, in group one, we're performing an inner join between the expressions in groups two and three. Any questions here? All right. So to form exploration, we are basically going to be iterating through each of the groups and seeing if there's any exploration rules that match. So the first rule that matches is generate index scans. And this basically allows us to create an alternate scan in which we're scanning the secondary index as opposed to the primary index. So remember we have the secondary index on column B. So this rule causes us to create a scan on that index. The next rule that matches is generate constrained scans, which basically means that we can actually push the filter down into the scan. So instead of performing a full table scan, we're only performing a partial table scan starting from two and going forward because we have B greater than one. This is B as an integer. So starting from two and up. Same rule matches on CD. We can scan the primary index since it's keyed on C. Next, we have a reorder joins rule that matches. So we're basically just swapping the order. You can see that the groups two and three are swapped to be three and two in the second expression. Next, generate merge joins matches. Merge join is basically when both inputs are sorted on the equality condition. We can just perform it almost like a merge sort where we only need to scan each table once and that's the complexity of the join. And finally, we have another rule that matches generate lookup join, which as I mentioned, this is like index nested loops join where we're iterating through one side and looking up into the index of the other side. So kind of like hash join where the hash table is already built. And there's actually even more expressions that we can generate in the first group, but I think you get the idea by now. And as it turns out, after we cost all these different expressions, the best plan is going to be to perform a merge join with these two constraint scans here. So that's the plan that's going to get passed on to the next phase. And I'll show you how the costing works actually a little bit later. But that is the exploration phase. Any questions? So is your cost model like, so Snowflake guys last week talked about their cost model that they don't run Analyze. Do you support like Analyze, like online assistant collection or are you relying on, like, I mean, I guess they're more, I guess, OLAPI stuff. So like, they're getting bulk loads. You guys have to maintain these stats all the time. Yeah. How I guess, I guess my question is how good, how good do you think your cost model actually is? And obviously it depends on how often you want to analyze. But like, is there any magic to that side? Like, are you using sketches instead of histograms? Are you using some of the more fancy data structures to do summarizations or is it just sort of textbook like histogram itself? Yeah, I'm going to talk about all that in a second. Okay, awesome. Sorry. We do some of that. All right. So the last phase is disciple planning. And this is actually not really part of the optimizer. It kind of happens after the optimizer is done. And it basically takes the plan from the optimizer, which as you saw, was basically, you know, designed for a single node to execute this merge join and scans are sort of just, you know, single node operations. And it extends them to whatever the cluster topology is. So you can see that the disciple planner will figure out that we need to scan CD on nodes one and three, and we need to scan the secondary index on AB on node two. Now, this is obviously something that we want to move into the optimizer at some point. There's plenty of opportunities for optimization here. Right now, it's sort of like a heuristic planner just for disciple planning. But, you know, for example, something that we could do is instead of just shuffling both tables here, like you can see with all these lines criss-causing where we're basically shuffling both tables by hash, we could instead maybe broadcast one table if we knew that was smaller or, you know, take advantage of how the tables are already laid out on disk. So that's something that we're hoping to do relatively soon. All right, choosing a plan. So this is going back to what you were asking about, Andy. You know, I mentioned that we have all these different factors that affect the cost, hardware, data distribution, type of operators, and number of rows processed. So we're currently not doing a lot with information about the hardware configuration or data distribution. And this kind of relates to us wanting to bring more of the disciple planning into the optimizer. Because right now, we only use the fact of data distribution in a few very select cases, which I'm going to show you in a sec. But in the next release, that's actually a big push that we're going to try to make the optimizer more aware of data distribution. The type of operators matters, obviously. It's going to be a lot more expensive to perform a scan than a select, because the scan may have disk IO, whereas the select is just going to be a CPU operation. But the way we deal with this is basically just assume that the relative cost is not going to change a whole lot between different queries. So we've, we basically performed a number of micro benchmarks to try to understand the relative cost of different operators. And we have those parameters hard coded into our cost model. Obviously, something that we also might want to improve in the future, but it's working okay for the time being. So this last thing, number of rows processed by each operator, that is going to, of course, change query to query, and depending on the data in the database. And so that's what I'll talk about next. So to find the number of rows processed, of course, we use statistics. And we collect the following stats on each table. We collect row counts. For each column, we collect the distinct count and null count. And you're asking about sketches and we use hyper log log to kind of the distinct count. The null count is basically an exact count. And then for some columns, we also collect a histogram, which we basically just do sampling to collect the histogram. And we use these stats to estimate how they change as the data flows through the execution plan to estimate how many rows each individual operator are going to process. So you said you collect histograms for some columns, obviously text columns, you don't want, histograms are useless. What are the other distribution factors that decide whether you want a histogram or not? So at the moment, we are just collecting them on index columns. Actually, you know, we're constantly changing things. So I realize we are, as of this coming release, we're collecting histograms on all columns. But for most columns, we only collect basically a two bucket histogram, which basically corresponds to the maximum minimum values. We collect bigger histograms with like 200 buckets for index columns, since that's sort of a hint from the user that those are going to be columns that are likely to be used in query predicates. Okay, awesome. Thanks. Yeah. And, you know, I'm going to talk about this in a little bit, but we currently only collect histograms on individual single columns, but hoping to expand that to multi columns. All right. So let's look at how we collect stats or how we calculate stats for our sample query from the last section. So suppose we have this data about tables A, B, and C, D. This is again collected offline, which I'll talk about how we do that in a second. But suppose we know that A, B, and C, D, each have 4,000 rows, and we have these histograms on columns in C. It's kind of weird looking, because this is my attempt to show what a equi depth histogram looks like. Equi depth histograms are a little bit better suited to graphenization than equi width histograms. So the way we propagate these stats of the tree is we basically will realize that, you know, for group three, we're applying the filter, C is greater than one, to our table C, D. So we can basically just cut off everything below that in the histogram and use that to estimate the number of rows, 1500. And do the same thing on column B. We can estimate the number of rows as 500. And then finally, performing a join is sort of like an intersection of those histograms or less, not exactly, but with this we can determine that we have 500 rows. So hopefully this gives you a sense of how we calculate how many rows are processed by each operator. And then we apply our cost model, you know, with all of our relative costs for different operators in order to figure out an actual cost for each of these expressions. And then the last thing I'll say about stats is multi-column stats. In addition to collecting stats on individual columns, we also collect stats on multiple columns. So this is useful for cases where columns are correlated. For example, if we have a predicate where state equals Illinois and city equals Chicago, you know, the way we would normally estimate the selectivity of this predicate would be to just multiply, you know, calculate the individual selectivities and then multiply them together, assuming independence, which obviously is a very bad assumption for this particular case. So, you know, we'd say state equals Illinois has selectivity one over 50 and city equals Chicago, say our database has 1,000 cities, it would be, you know, one over 1,000. Multiplying them together is going to give us selectivity that's way too small. So ideally multi-column stats would help us fix this problem. But the problem is how do you know which sets of stats to collect? If you think about it, the power set of possible columns is, you know, exponential. So again, we've kind of taken advantage of the indexes as a hint for what types of predicates we're likely to see in different queries. So in addition to the individual columns ABC, we also collect multi-column stats on index prefixes because all indexes in Congress DB are ordered or clustered indexes. So therefore you can use the index as long as a prefix of the columns are constrained. So we collect multi-column stats on the set AB as well as ABC. And currently we're only collecting distinct and null counts although hopefully multi-column histograms will be coming soon. Any questions there before I move on? Hi, this is Lin. Rebecca, so I have a question that when you collect the multi-column stats, so maybe I missed this in your slide, but what if you perform your drawing on two tables, right? You have a drawing predicate on two columns on two tables. How do you calculate the stats there? Are you going to collect multi-column stats across tables? That's a great question. Yeah, right now we're only collecting multi-column stats on a per-table basis. So, you know, those kinds of joint conditions are notoriously difficult to calculate and I think right now we're probably way underestimating the select of those types of joint conditions. So, in that case, that would just be a multiplication I would do. Exactly, yeah. Yeah, another small question is how often do you update those stats? Do you run Analyze or do you have some incremental online algorithm updated? Yeah, that's what I'm going to talk about next. So, good lead in. So, yeah, so we have this command create statistics which can either be run automatically or run manually. So you can as a user just run create statistics s from table t. This is, we also support Analyze t which is just syntactic sugar for create statistics. It's the same syntax as Postgres. And what that does is it basically triggers a full table scan of whatever table you're collecting stats on. And here I'm showing the disequal plan for the stats collection. So all these table readers are basically scans of the table spread out across five nodes that we have here. Then we perform sampling in the following step where we collect a 10,000 size sample, 10,000 rows. And we also, you know, insert each row into our hyper log logs to calculate distinct counts for each column. Then these samples are aggregated and finally returned and stored in a system table that stores the stats. So we also run this command automatically and we collect stats automatically whenever a table is created. A new column or index is added because, you know, we want to also collect columns on, we want to collect stats on that column and take advantage of the index to, you know, either create histograms or multi-column stats. And we also want to collect stats when approximately 20% of the data in a table has changed so that, you know, it's not, the stats never get too stale. So some challenges with this is, first of all, how do we even know when 20% of the data has changed? And maybe this seems kind of obvious, but it's actually not when you think about the fact that Congress DB is distributed and you can have updates on multiple nodes happening at once, you know, lots of unrelated updates. And we want to avoid having a centralized counter that could be a source of contention. So this is one challenge. We also want to make sure we minimize performance impact because this is running in the background when we have production workloads happening in the foreground. We want to make sure we don't impact those workloads. So how do we know when 20% of the data has changed? So we actually use a statistical approach and basically each node can kind of operate independently. So after a mutation on that node, we have a kind of a thread that's constantly running in the background and the mutation will tell that thread that, you know, I updated some rows and the thread will basically kind of roll a dice and based on this formula potentially trigger a stats update. So the formula is the number of rows updated or deleted divided by the number of rows in table times .2. So that's how we get 20%. And on average, the table is going to be refreshed after 20% of the rows have changed. Although, you know, it's obviously statistics, there may be some outliers. So to avoid any problems, we'll always refresh if there are no stats yet or if it's been a while since the last refresh. And to minimize performance impact, you know, as I said, there's going to be many updates per second. And each node is kind of operating independently. Each create stats run is going to take on the order of minutes because we throttle it, as I'll talk about in a second. So if each node was just kind of triggering stats updates whenever it wanted to, we could potentially have multiple stats runs happening at once. Full table scan can impact performance. But many table skins at once can actually bring on a cluster. So hopefully, the execution team is working on fixing that problem, but it's something that we want to avoid. So how do we avoid this? Well, what we do is we run this create statistics command as a job. And we have a jobs framework that's used for long running types of operations like backup, restore, import, export, schema changes, and so on. So we take advantage of that infrastructure because it allows us to guarantee that there's only one stats job running at a time across the entire cluster. And it's also resilient to node failure. So whichever node is managing the create statistics run, if that dies, another node can then adopt the job. And then even with all of this assurance that we're only running at most one stats job at a time, that can still impact our production workload. So we throttle the stats collection to limit speed utilization to avoid overlaying the cluster. And this is adaptive. So if we detect that the CPU is mostly idle, then, you know, we'll use more CPU for stats. If it's already being used by another workload, we'll, we'll tamp it down and, you know, insert additional sleeps basically into our scan of the table. All right. Any questions about stats? Go ahead. Keep going. We're good. Cool. All right. How are we doing on time? 20 minutes, plenty of time. Oh, okay. Great. So, you know, I've said a few times that we're not as aware about the locality as we would like to be. That's something we want to improve since, you know, geo distribution is obviously a kind of key differentiating factor of Congress DB. But, you know, we do take it into account in a couple of places, which is important because network latencies and throughput are important in geo distributed setups. And the way we take this into account is customers can optionally duplicate, read mostly data in each locality. So if they have a table that is updated very infrequently and is likely to be referenced from multiple different regions, we suggest that they actually create a copy and store it in each region. And at the moment, this is kind of manual on the part of the user. We're working on making that more transparent. But at the moment, the way this works is, let me go to the next slide. So you have this table postal codes. This is one of these kind of read mostly tables that is going to be updated very infrequently and maybe access from anywhere. So what we do is we have the primary index, which has ID and code. And then we create two secondary indexes that are identical to the primary index. So, you know, they're keyed on the ID, and they store the column code. And then we can ping each of these three different indexes in each of the three different localities. And then the optimizer will basically know to choose whichever index is closest to the gateway node where the query originated from. So, and, you know, we have that in our cost model where all things being otherwise equal or choosing between identical indexes will pick whichever index is closer. So, you know, this gets complicated if you say add a column and forget to update your other indexes or you add a region and you forget to move one of the indexes there. So this is something we're working on improving. And that's going to be kind of a new concept called global tables, which are going to have fast read access in basically every locality and slower write access. We're also working on supporting geo-partition unique indexes. So this is kind of another important feature to improve our multi-region offering. Basically, you know, if you think about the way we'd support a unique constraint today in a single region, you would just have a a single unique index on whatever attribute you wanted to have a unique constraint on. But that gets complicated when you try to partition by region because then it's, you can't ensure just with the index that the value is unique. So we're adding some extra functionality there. I also mentioned that we're working on moving some of the disequal planning of the optimizer and just generally incorporating latency more into the cost model. So lots of work to do here, but exciting, exciting stuff. All right. And then to finish off, I thought I would spend a little bit of time talking about theory versus practice. I, when I left grad school, I think I had a concept of what query optimization was, which, you know, wasn't totally wrong. But I think the emphasis turns out to be a little bit different, especially when you're working with an OLTP database. So when we first created version one of the optimizer, remember, we already had a heuristic planner before that, if we wanted to make sure that all of the workloads that were running really well with the heuristic planner didn't suddenly have a huge regression once we turn on the optimizer. And, you know, because even though the heuristic planner obviously has a lot of problems, it's really fast. So we spent a lot of time trying to minimize overhead for simple OLTP queries. You know, these are like the primary key lookups and updates where you really don't need a lot of optimization. And in order to do that, we took advantage of logical properties, which are actually essential for all parts of optimization. So these are things like cardinality, which is different from stats. Cardinality is basically things that we can statically prove about a particular operator in terms of what it's maximum and minimum number of rows are. So for example, if you have a limit, say limit one, the maximum number of rows output by that operator is one. Similarly, if you have a predicate that has a contradiction, we can just kind of constant fold that to false, and that will return zero rows. Last week, in the Snowflake talk, they were talking about how they could kind of take advantage of their stats, which were always perfectly accurate to perform these types of optimizations. Our stats are not 100% accurate, but there are certain things that we can statically prove that we can take advantage of. Functional dependencies are important. Things like the key of a table always functionally determines all of the remaining columns. Figuring out which columns are never null is important, and so the list goes on from there. Another thing with an OLTP system in particular is that normalization rules are really important. So we have, as of today, I checked 242 normalization rules and only 29 exploration rules. Part of the reason is that every exploration rule is going to potentially increase the size of the search space. We're just going to slow down planning time. Normalization rules, since they never keep around, the old versions are always generally good to have. And then lastly, I never realized this, but you can actually get a benefit by optimizing things like foreign key checks and cascades. So if you think about it, a foreign key check is something like you insert a row into a child table, and you need to check that a particular column, you know, the column with the foreign key reference, actually has a value in the parent table. And we do that by creating a join between the two tables. And, you know, joins can be optimized. We can choose a particular join algorithm or a particular index. So we spent a lot of time actually pulling all of that code that was previously on the execution side into the optimizer. It's like I insert into the parent, the child table, and then you run a select query that's joined against the child table and the parent table and the key you just inserted. Yeah, exactly. So this is after the query is run, but before like the transaction is over, we have the concept of a post query. And if, you know, you can either say, depending on what the type of check is, if it returns any rows, that's going to cause an error, or if it doesn't return rows, that's also going to cause an error depending on what we're checking. Oh, so you do it at commit time, not on the critical path of the query that did the insert? Right. Yeah. You know, if it's part of a multi statement transaction, I think we still do it after the specific statement. We don't wait all the way until commit time, but that's something we could consider doing in the future. I'm thinking. Yeah, actually, yeah, it makes sense. Right, because if it's an update, you don't know, it's not like insert, you know what you just put in the update, you know, if you just modified it, you don't know what happened. Okay, okay, makes sense. Cool. All right. All right. So join ordering is kind of the classic, you know, Holy Grail of query optimization. I think everybody rightly assumes it's one of the most important problems and it receives the most attention. But believe it or not, we almost shipped v1 of our optimizer without any join me ordering whatsoever. In the end, we did actually implement two exploration rules, commute join and associate join, which basically just, you know, implement those, you know, the commutative relationship and associated relationship between joins. So this is actually really inefficient, but it was a good way to kind of, you know, fast and dirty get some join reordering into our first version of the optimizer. And because it's so inefficient, we limited it to at most four tables by default. You could change that if you wanted more, but it would increase the planning time. So this summer, actually, we had an awesome intern that fixed our join reordering to use the, oh, what happened? I lost my screen. Sorry. Zoom refreshed my screen, too. It's annoying. Yeah. Yeah. Anyway, so he implemented DP sub e from this Sigma 2013 paper you see here. And it basically, as it says, performs a correct and complete enumeration of the core search base. So it enumerates basically all of the plans you might want to consider kind of excluding plans with a lot of cross joins, but it enumerates bushy plans and all that. And it's a lot more efficient than what we had before. So now we order up to eight tables by default. If you have a query with 12 tables, we'll basically just reorder the subtree of eight tables. So I didn't point out that the first author of that paper, that's, first of all, they're Germans, but that's Thomas Neumann's, that's his advisor. Cool. And there's a book he's, is a book he's writing on query optimization. Awesome. Is it, I think this might be the algorithm that Hyper uses or some, no, Hyper doesn't use, this is Cascades, right? Well, it's, so it's a dynamic programming algorithm. It's basically one, you know, we're still doing Cascades, but as part of this, when we ordering, we sort of, you know, almost like pull the tree out and create a hypergraph. Okay. I think this is, this is 2016 or what year is it? 2011? 2013. But I think there were other. I think there's a newer one from Hyper that, that is a build on top of this. Okay. Cool. Yeah. So yeah. So another important thing is query cache, which I think doesn't get talked about a lot. We have an LRU cache keyed on the SQL string and it stores the optimized memo. And for prepared statements of placeholders, it stores a normalized version of the memo. So then to actually execute the prepared statement, basically replace the placeholders in the normalized memo and then perform additional normalization and exploration. So we can take advantage of, of some optimization up front and then we perform the rest after replacing the placeholders. Another thing that's important is optimizer hints because, you know, the optimizer is not always going to choose the correct plan. So you might want to be able to just hint to the optimizer, you know, use the specific index, for example, we have an at index name syntax that allows you to do that. And you can also hint a particular joint algorithm. So, you know, in our hash drawing, if you want to force a hash drawing or similar for merge and lookup. And this also allows you to hint at a particular joint order because as soon as you hint an algorithm, we're not going to bother changing the joint order. So you guys try to support the person's dialect, but I don't, like, this is definitely, like these hints are awesome. I like this syntax, but I don't think Postgres supports this kind of stuff. Yeah, no, this is not from Postgres. This is specific to you guys. Right. Right. Exactly. This is good. I like this. Cool. Yeah, I think we had somebody who did a project where they were actually, you know, telling the optimizer, which plan to create by giving it different hints and then, you know, doing some kind of learning optimization based on that. I don't know, something cool like that. But all right, another important thing is debugging tools, because even with, you know, all of these tools, our customers still occasionally say, you know, why is my query slow? And we have to figure out why. So the customer runs, explain analyze debug, or, you know, there's also something they can click in the UI, and it generates this bundle that we get, which has basically everything we need in order to recreate the query. It has the JSON with all the stats, has the schema, the environment variables, various levels of verbosity of the plan, as well as the original statement. And with this, we can basically recreate the plan and try to figure out what's going on. So that is all I had. Final note, we are hiring. So definitely come talk to me or check out our careers page. We're also open source and we're always accepting contributions. And with that, I'll be happy to take any questions. Awesome. This is my dad. So I'll applaud on behalf of everyone else. So I've asked a lot of questions. So we can open the floor up to anybody else that wants to ask questions. Otherwise, I'll just keep going. May I ask? You don't have to ask, just go for it. Sure. Thanks a lot for the very interesting talk, Rebecca. I actually have two questions here. First is I'm curious about the query cache. You said you only cache the memo and then you will do the normalization and the exploration again, right? But is it true for every query? It's like even for a simple LTP query, you'll still do all the normalization and exploration again? Yes, that's true. So I, you know, sorry, what were you going to say? I was just going to say, I mean, the LTP query itself may just run, I don't know, like milliseconds, right, or some milliseconds. So if you're going to do the normalization and exploration again for all the LTP query, that sounds a bit small, right? Yeah. So the problem is that there are certain types of optimizations that we can't really do until we have the specific value for the placeholder, like constraining the index, for example. And actually that is something that we talked about doing in the past, like basically just kind of assuming a default value for the placeholder and creating, you know, and constraining the index so that we have a fully normalized plan that, you know, a customer could just say, set a cluster setting so that it's always just going to use the default plan and not bother optimizing for the prepared statements. Yeah, that's definitely something that we've considered and, you know, if there's enough demand for it, we might actually do it. I see. Another question I have is, you said you want to bring in the optimization for the distributed query into the optimizer, wondering at what stage do you mean that do you think that distributed, like the data distribution information would affect normalization? Probably not, right? Would that affect exploration or is like after normalization exploration, you think that probably should be the place to adding the optimization for the distributed query? Yeah, great question. I think probably it would make the most sense at the exploration phase because, you know, that's when we might say explore different variations on join algorithms like I was talking about, instead of just, you know, always doing this kind of shuffling by hash value, you know, we could do a broadcast join or, you know, some sort of kind of range, partitioning something along those lines. Yeah, and, you know, it's also going to come into play in the cost model, obviously, which doesn't get used until the exploration phase. So, you know, we're not, even once we bring into the disequal planning into the optimizer, we're not going to actually perform a full disequal planning at that stage. You know, things like very specific, you know, node identifiers and things like that, you know, might not be that useful in the optimizer. We might just care more about, you know, which regions do we have and, you know, the specific nodes we can leave until after optimization. So, that's, you know, something that we still need to work out, but. Right, makes sense. Yeah, thank you. Yeah, thanks. All right, anybody else have any other questions from Becca? Okay, let me go. How do you guys test this thing? So, how do you, like, make sure that, like, one, you don't have any regressions, but that, like, your cost model is, you know, correctly guiding the search to optimal plans? Like, how do you, like, and that the cost model is correctly identifying that one plan, you know, the cost model says this plan is better than this other plan, and then when you run it, that's actually true or not. So, like, the ORCA paper, for example, talk about this thing, TACO, where they're running these things all the time, they pick the best plan, and the second best plan proved that it actually, that the ordering is correct. Like, what kind of automation you guys have set up for you for your infrastructure? Yeah, good question. So, we don't have a lot in the way of proving that the plan that's chosen is the best plan. I think that's something that we probably will want to out of the future. I mean, most of it's about proving correctness. So, we have this kind of data driven testing framework, which is, which is really nice. Basically, we, you know, have a whole bunch of query. I think it's similar to maybe what the SQLite testing does, where, you know, you have a whole bunch of queries and it outputs the expected results. We also do that with the query plans, you know, we can expect certain query plans. But that's like testing, like, correctness of the plan generated. Like, this is more like, how do I test, like, is the cost model, like, is the cost model guiding the search correctly to the optimal plan in some ways? Right, right. Yeah, you know, we have some tests that we used to basically tune the cost model, you know, like I was talking about the relative cost of the different operators. I'm trying to think if there's something that we kind of run on a more continual basis. I don't know if there's any cockroach people on the line, feel free to jump in if you can think of anything. That's okay. And then for that little bundle thing, for the debug bundle, again, not to, there is an orca person here on the call, and that's why they were messing with me, so I apologize for sending out to everyone. The orca talks about how, like, oh, because I, you know, our database system is not running in the cloud, it's running on customer sites. So therefore, if there's a problem with the query optimizer, like we get, you get back a debug bundle. And in your case, it looks like you're getting all the input parameters for the, for the actual, that was used for the query. And is that enough to debug everything you need, or would you like to have some other functionality that you guys currently aren't shipping? Right now, that's enough to debug everything we need. Once we start adding more to the cost model to, you know, make it more aware of the cluster topology and how the data is laid out, and, you know, we're probably going to need to collect more information than that. But as of right now, we can exactly reproduce the plan. To answer your previous question that I was just kind of thinking in the background, you know, we do run standard benchmarks like TPCH, TPCC, and, you know, we can kind of track the fact that since we added this improved drawing we were doing, for example, over the summer, our TPCH results are much better than they were before. So, you know, and we make sure that certain plans don't regress. Like, you know, we say we know this is a particularly, this particular plan is good. And if we see that it's changing, we try to figure out why. You know, Do you guys run a SQL father? Like something like SQL Smith? Yeah, yeah, yeah, we have SQL Smith. So, but you know, that that is more kind of just catching errors, like if it's going to cause an internal error or crash or something like that. But yeah, actually, Manuel Rigger, he's the guy that was doing more kind of logical. Yeah, exactly. He did a bunch of experiments with CarcassDB, and he opened a bunch of GitHub issues for us, which was just pretty great. SQL is awesome. Yeah, trying to get it running in our system. In general, what's the complexity of the queries that you're seeing? And I understand that like, that might be sort of self-selecting because like CarcassDB is not primarily like a snowflake, Vertica data warehouse. So, therefore, the kind of queries people might be running on it, like the complexity might be limited because they might push back to like something like, like would you say you're seeing at the level of like TPC-H that's very common, or even like TPC-DS, are you seeing things that are more complex? That's a good question. You know, we have some telemetry. We have, so, you know, we now have these two products. We have an on-prem database product and a cloud product. With the on-prem clusters, you know, by default, it will send us telemetry information, but customers can turn that off. So, you know, we're never sure exactly if we're getting everything. I think with cloud, it's just we collect telemetry and everything. So, you know, as that business grows, we'll get more information. You know, for example, we collected telemetry on the maximum depth of the join tree, and the vast majority are, you know, one, two, three tables, and then it just kind of exponentially increases. You know, we've definitely, we've seen people run 14-way joins on CartridgeDB, but it's very uncommon, I would say. So, I, I mean, this is a useless metric, like the number of tables, but I felt like there was for a while when I would talk to different vendors, like, oh, you know, what is your query app for my review? And for whatever reason, they would always boast on the number of tables they can join. Like MemSQL, it's like 75, SpliceVG was 135, and the HANA guys were like 1200 or whatever their number was. Like, I, again, what you're describing me is what I think is more realistic than, like, most people are running, like, three-table joins. Yeah, yeah. Right, exactly. Which is why, you know, we were kind of okay with having our suboptimal join reordering algorithm for a while and, you know, our new algorithm is much better, but it still is defaulting to reordering at most new tables, so. Okay. All right, so the last question I'll ask, and if anybody else has any questions, please interrupt. I'm going to ask you a variation of the question I normally ask other people is, how stupid are your users? And, but I, but I'm specifically asking about in terms of the, there's normalization rules you run, like, not not x or one equals two, like, you obviously want to have them because they show up, but like, how common are those sort of, like, like, double negatives or like the clearly, you know, always about like a false. Yeah, typically, those types of rules are going to be more useful for, for example, ORMs will generate really weird queries. And also, our own normalization rules, when they fire, they might produce some weird, you know, construct that we then want to further normalize away. Yeah, all right. Yeah, I wouldn't say our users are writing these bizarre queries, it's more just the kind of automation. Actually, the last question too is like, are there any, is there any, like, restrictions on those rewrite rules in terms of the complexity that they're allowed to, like, you mentioned, like, oh, I can evoke a function that's in go. Like, how do you make sure that, like, that function doesn't compute pi or some crazy shit like that? You know what I mean? Like, because I can imagine a scenario where you maybe want to go look in the catalog and get some additional metadata that could help you do the constant folding or some other rewriting, but that might be an expensive call in the environment. Yeah, that's a good question. I mean, I think we're sort of just relying on the fact that, you know, these are not user-defined functions. We don't allow users to just create this code. This is all, you know, either written by us or written by, you know, contributors who, you know, we're going to review that code. Yes. But yeah, that's a good point. There's probably a risk of something like that happening. Okay. All right, the last question, I guess, it's an easy question. CockroachDB is hiring. How important it is if someone's enjoying the query optimization team, like, are you, first of all, are you guys hiring in that space? Yeah, I mean, I don't think the optimizer is the team that's hurting the most right now for people. I think we could definitely, we could definitely use somebody. There's only three full-time engineers on the optimizer team right now. But there's only two full-time engineers on the SQL execution team right now and, you know, other teams that are understaffed. So we need people everywhere. But no, I always say in the class, Becker, you're ruining for me. I always say, like, if you hire, if you do query optimization, I need your job tomorrow. Like, if someone comes in and is a student that's really good query optimization, you would hire them, right? Yeah, probably. There we go. That's all I want. Okay. Awesome. All right, Becker. For sure. We're, you know, hiring interns as well, so. Okay. Awesome. All right, Becker, thank you so much for spending your afternoon with us. I'm glad you're safe and healthy in Chicago. And this was awesome. This talk was exactly what I was looking for. So, again, thank you, Becker, for being with us.