 We have a lot to go through, so let's get started. So again, homework, not homework. Project two is due tonight at midnight. And again, as a reminder, we will do the speed test. The more thorough speed test on the machine, the more cores this weekend on another machine. So we'll post updates on that. But it's basically the same thing as the speed test you have now. We're just going to add more cores and make the key space larger. So today, we're going to be talking about cost models. So this is sort of the part of the query optimizer that I've skipped past in the last two lectures. And just assume that there was going to be something that says that they tell us that one query is better than another. So what we're going to talk about today is actually how do you actually do that calculation to determine whether doing a cost estimate for a particular query. So we'll start off today talking about the different types of cost models you can have. Then we'll talk about how you do cost estimation in using one of these models. And the paper you guys read from the HyperGuys was a study that shows just how wrong cost models can actually get, how bad it actually can be. So then we'll finish up talking about the extra credit assignment that'll come out today. And then if we have time, I'll sort of give you my sort of tips or things that I've sort of experienced or learned in my life working at a large code base. Because you guys are going to need to be able to reason about or understand the full system now for project number three. So I just want to give some general advice of how to get started on it when you have a large amount of code that you didn't write. And the goal for this is that this will hopefully serve you well, not just here in this class, but when you leave CMU and go out in the real world. You're almost always going to encounter really large code if you stay in programming or computer science. So just sort of again, how can you, what are some tips you can do to get started on that more quickly, independently too. All right, so again, the emphasis of the thing that we're talking about today is how we're going to determine whether one plan is better than another. So remember I talked about before in the stratified search model for optimizers, you do a bunch of heuristics in the beginning to do some predicate push-downs or some sort of basic rules you know you want to always apply to a query. And now we're in the sort of the phase where you look at different joint orderings and other things and other numerations of plans. And you want to compare to see whether one's better than another. And as I said, like we, these cost estimates we're going to generate are only going to be meaningful to that particular instance of that particular database at that point in time running that hardware. It's not something we can take and you take a cost model for Postgres and take a cost model for MySQL and look at the values that they're generating and determine whether one's actually doing better than another. And it depends heavily on the hardware, it depends heavily on what the database looks like. So these things are really only a relative measurement or estimate that allows us to compare two different plans. And everything that we're going to talk about here is independent of all the search strategies that we talked about last class. So whether you're doing the Starburst model, whether you're doing Cascades, whether you're doing the randomized algorithm in Postgres, all of these different approaches still need to use the same cost model. So regardless of how you implement the actual search algorithm in your optimizer, you can use any of the cost models that we're talking about here today. So a cost model can be comprised of three components. And I will say that these are not mutually exclusive. So just because you do things with a physical cost doesn't mean you can't do this in logical costs. Typically, database systems will do an amalgamation of all these things, put them together to figure out what a good cost is going to be. But here's the basic three categories you can have. So the first is you can have the physical costs of executing a query plan. So this would be things like how many instructions or cycles am I going to execute on the CPU? How much IO am I going to have to do? Cache misses, RAM usage, memory usage, how much pre-fetching I'm going to have. Again, all the low-level things that the machine will actually do when it executes your query. So obviously, this depends very heavily on the hardware that you have because different CPUs have different CPU cache sizes. Different CPUs have different clock speeds. And it's even more complicated in newer chips because in the old days, they just rashed up the number of gigahertz for the CPU. And that means you always knew it was faster. But now they don't really do that anymore because of the heat problems. So they add extra cores. They add additional instructions to do vectorization and SIMD. All this extra stuff you have to consider and makes a big difference on what query plan you may want to choose. So for physical costs, this is heavily dependent on the hardware. And this is hard to do because when you think about Intel, they spend millions and millions of dollars generating simulators for their CPUs to determine whether a new design is going to perform well or not. And now you basically need to recreate what Intel has done directly inside your database system. And that's really hard. The next type of costs you can have are logical costs. So these are the estimating the amount of tuples. The more tuples, the size of the data that each operator is going to generate for its output. And this is done completely independently of the physical plan operator. What algorithm are you using for that operator? So what I mean by that is for this particular cost type, whether you're doing a hash joint or a nest loop joint, the cost is always going to be the same because the joint's always going to produce the same result, the same number of tuples. So you can do this directly based on the logical plan operator and say, I know I need to do a join and I expect this much output to come out of it. The third choice is to actually now take in consideration what the physical plan operator is actually going to do and look at what algorithm they're going to use for whatever it is that you want to do. So this is where you determine your access path methods, where they do an index scan, it will be n log n, doing a sequential scan, it will be n, or a hash join versus a nest loop join. So that's where you consider that in here. So as you can see, there's no ones better than another. And you kind of need all of them. And so this is what the, in a real cost model, this is essentially what they're going to do. They're going to try to combine all of these things into a single equation and then spit out a number that says, here's the estimate for a particular plan. So in a disk based system, the main thing they're always going to consider is disk IO. So if you took the intro class, we spent weeks discussing the cost of joint algorithms based on the amount of IO, I'm going to have to go incur to go read pages off a disk and bring it to memory. And in their world, they're going to assume that the CPU costs are negligible because the cost of going and reading a page from disk is, or is the magnitude slower than computing some hash join or something like that. So in a disk based system, this is going to be the main cost. Now they're going to try to include all the other stuff that I just talked about before. But this is always the main cost you have to consider when you make choices. So if you have a really good way to model the amount of disk access or disk IO you're going to have to do, then that will serve you pretty well in a disk based system. And they can do this. You can get really good estimations because in a disk based data system, you're going to have complete control over the buffer pool, the buffer manager. Remember I said you never want to use M-Map. Part of the reason is because if you use M-Map, then you're letting the OS decide what things get swapped in and out. And that's going to be hard for you to model in your cost model, in your cost estimates. But if you have, if you're controlling everything, being the replacement strategy, how pages get pinned, and assuming that you're the only thing accessing that disk, then you can have better accurate estimates or simulations of what you think the cost is going to be for a disk based system. Now we're not going to talk about distributed databases or multi-node databases. But everything I said here, if you just replace the word with disk IO, with network IO, then it's essentially the same thing in a distributed database. Because now going over the network to get data from a remote node is always the bigger cost. So if you look at what Postgres does, remember in the last slide I just said that they're going to assume that the cost of reading something from disk is significantly more than reading something from, or doing some computation on a tuple. The way the Postgres cost model works is going to be this combination of all these different components that we just talked about. And then they're going to have these sort of magic constant factors that say this thing is a cost more than that thing. So the example I said here is that, again, reading something from in memory, they're going to claim to be 4x times faster than reading something from disk. So they're always going to prefer to algorithms or prefer plans that try to read as much data from memory as possible. And they're also going to model that doing sequential IO whenever you do a scan is going to be 4x faster than random IO. So these are like the default settings. You actually can go when you start a Postgres. You can go tweak these values to make them more representative of whatever environment you're working on. But obviously that's really hard to do, because most people don't know how to tweak values. And furthermore, if you actually look in the documentation of Postgres, they talk about how in their world, all these different cost factors are relative to each other and relative to the cost of doing a sequential scan on a page. And so they had this little disclaimer here that basically says that, yes, we have these values and we sort of set them to be some reasonable default things, but they're probably not going to be very accurate. But then they warn you that if you go ahead and change them, it might actually make things worse. So this is sort of showing you how complex these things are. And a big part of how they're doing these cost model estimations is really based on what they think of these reasonable waiting factors should be for different aspects of a query plan or different operations you want to do. We'll see in a second one particular system that tries to run micro benchmarks to figure out what these values should be, but Postgres doesn't do anything like this. Then we actually go look to see what DB2 does and it's way, way, way more complicated. So the way they're going to compute the cost of a query is a combination of all the selectivity and cardinality distribution of the data, what the hardware looks like, what kind of CPU you have, how much memory you have, how you, what the actual storage devices look like, like how fast it's going to be to read a patient disk. If you have a distributed database, again it's the bandwidth or network communication that's going to kill you so they worry about that, then memory resources, and then what I think is actually kind of cool, they actually try to estimate what will be going on in the system, like what other queries will be running in your system when your particular query is running. So if you know that your query is going to have to take a lot of locks, then, and there's a lot of queries at the same time, it also won't take a lot of locks, then they may choose a plan to try to avoid taking locks, right? So they also try to include, they also try to model the matter concurrency or resource demand you're going to have in your system while it runs. And again, the main takeaway of all of this is like, you have all of this stuff you have to consider when you compute a cost model, and that the commercial guys are much more complex than the open source guys. And the commercial guys usually don't tell you what's the, how they actually do things because that's sort of the secret sauce of what makes a commercial system better than an open source system, right? In a commercial system, they have guys that spent years and years and years working on trying to come up with really good models to compute good costs. And the MySQL and the Postgres of the world simply aren't going to have that. So now in memory system, what database system we care about, we really no longer have IO, right? Because everything is assumed to be in memory. So in before, when I showed the Postgres, they would assume that processing a tuple in memory versus processing a tuple on disk would be 400x times faster. And now in our world, we don't have disks. So now the cost of maybe processing something with a CPU or doing some kind of computation will be essentially the same as just reading a tuple, right? So again, in a disk-based world, as we talked about with compression, because reading from disks so slow, they're willing to pay some extra cycles to decompress and compress data as it moves in and out from the buffer pool into disk. But in our world, we don't want to be decompressing a tuple every single time we need to read it because that then would be slow. Probably just better off just go read the tuple. So the other tricky thing we're gonna have to care about is that unlike in a disk-based system where we have complete control over how data is getting moved in between the memory and disk, in a in-memory system, we don't have that anymore, right? We can't control how the CPU decides how to move things in and out of our caches. So we can try to do things. You can try to maybe touch and refresh a cache line to keep it in your L3. But again, it's really up to the hardware to decide how it decides moving everything in and out. So that means that essentially unlike before with our buffer pool, we don't know what the replacement strategy is gonna be. We can't pin things in our CPU caches. And then with the deal with shared caches could be other threads running at the same time. And then we'll talk more about this when we come back from the break. But now we also have to deal with accessing memory that may not be local to my socket. So if I want a multi-socket machine and I'm trying to access data that's managed by another socket in their DIMS, I have to go over to the interconnect or the bus to do a remote read in that memory. Then I have to copy things back over, right? That's another cost we have to now account for in our system. So typically as far as I can tell from reading the manuals of other in-memory databases, what they're all gonna do, the way they're gonna compute what the cost is gonna be for a query plan is just based on the number of tuples that you're gonna have to process per operator. And the reason, again, ignoring distributed databases for now, right? Assume we're only running on a single node. And the reason why you wanna choose just to look at this one metric is that in some ways it encapsulates all the things that could be going on inside of our system. So again, if say I choose a query plan that has an operator that generates a large number of tuples in its output, well, that's gonna ruin my cache and I would see that reflected in a slower query. So by knowing how much tuples I'm generating in an operator, I sort of capture that aspect of my query plan as well. So for the purposes of today's lecture and as far as I know from what real systems actually do, this is what everyone does. Cause this is easy to do and we're not easy to do, but you don't have to do more complicated modeling to get good estimations. Again, we'll see from the hyper guys that in there what they argue is that doing a better job at estimating this fixes everything else. We try to do more complicated things like a DB2, you don't really get that much improvement. So the one cost model that I do like to bring up for historical reasons that is different from this is this approach that was used in small base. So small base as I said before was this research prototype that was built at HP labs in the early nineties. In the early nineties is when all these sort of in memory databases first started getting built. AT&T had one, this thing called Dolly became data blitz and then HP labs had this thing called small base. And then they realized they had actually a useful system. So then they spun it out to be a startup and they renamed it to times 10, which I'm certainly everyone should have heard about by now. And times 10 was then bought by Oracle in like 2006 or so. So anytime in the early days when I was doing research on memory databases and I go talk to people they would always bring up times 10 and say well, wasn't this already done in times 10? So times 10 is like an older system from the 1990s. Oracle doesn't really fix it up as much as they used to. It's sort of in maintenance mode. In our own system, we can already outperform it even without doing a lot of optimizations. But so back before it was times 10 of a small base and they had this sort of interesting way to do the way they did the cost models was sort of this two step process where in the first step the actual developers of the system, so not like the people that actually deploy the database the actual people that are writing the database system code, they would call up these list of these low level primitives or operations or operators that the database system wasn't gonna have to do when it executed a query. So things like the cost evaluating the predicate, doing sorting, doing join, doing index probe. They would come up these lists of these sort of these micro steps. And then what would happen is when you then downloaded their software and deployed it on your machine when the system started up it would run a little micro benchmark would go through all those lists of those operators that were defined ahead of time and then figure out what the actual cost would be running on that system. This would be terms of how much CPU cycles are gonna spend and how much memory they're actually gonna use. And then what would happen now when you actually do query processing in your optimizer you would use these profile measurements that you generated at the beginning to then help you do better estimations of what would be the cost for executing a physical operator. So if you're familiar with Bogo MIPS in Linux like if you ever look at like slash proc slash CPU info you'll see this little entry called Bogo MIPS. And what that basically is the same thing Linux is doing the same thing that every time you turn on Linux and you boot up the operating system it runs this little micro benchmark to try to estimate the number of the millions of instructions per second you can run for your hardware. And they use that as a way to do estimates for scheduling and interrupt handlers and other things. Sort of a quick and dirty way to say what's the actual performance of this hardware. So they're doing the same thing here they're basically booting the system up and figuring out how fast can actually go on some basic micro benchmarks. And obviously this changes a lot if the size of your database can vary a lot then the cost of these different query plans can be much different. So you're not computing the query cost just based on this. You're using this in a combination of the logical and algorithmic cost that we talked about before. So this would be example of the physical cost and then you augment them with the logical and the other cost components. So I think this is a kind of interesting idea this paper came out in 1996. As far as I can tell when you look at the documentation on times 10 and actually in talking to the times 10 developers at Oracle that this is all gone. So they don't do this anymore. I don't know why they got rid of it. And now times 10 basically does the counting of the number of tuples generated from the operator. So another thing we need to talk about too is to ground our discussion is given that we just spent the last class and last two classes talking about these search algorithms and in particular we focused on the Cascades model the top down search. The question is now how do you actually use one of these cost model estimates in Cascades? And the tricky thing about in Cascades is that because it's a top down search that means that you're looking at the goal or the final output you want for the query and then you're adding operators as you go down to get you to where the starting point how you actually start doing scans. So that means when you're at the upper levels of the tree you don't actually know how you're gonna do scans at the bottom of the tree. Because you haven't done the transformation yet to convert them from the logical plans to the physical plans. So you don't wanna have to do a complete traversal to get to the bottom and figure out what the actual cost is gonna be. Because that would defeat the whole purpose of branch and bound. So the way you do this in Cascades is that when you have a plan that has a logical operator at the bottom you estimate what the worst case scenario it is for that operator. Like if I'm doing a scan on the table and my choices either do an index scan or a sequential scan those two choices are physical plan choices, right? The logical plan just says scan table A. So you assume that for that logical plan the worst case scenario which is the complete sequential scan is going to be the cost of that plan. So what you do is you maintain actually two costs per operator group. So when you only know that you have a logical plan then you use the worst case scenario as the upper bound. And then as you go down and you do your transformations and refine it to actually go from logical plan to physical plan then you can compute what actually the true cost is using the cost models we're talking about here and now that you use that as the lower bound, right? Because the way branch and bound searches that as you go down in the tree and you start instantiating the physical operators the physical plan operators you can never have the cost go up because that would violate the way the search tree works. So you always assume that the cost is really bad and then once you know more about what the plan is actually going to do then you can lower it. Right? Okay. So given that we just assume now that the number of tuples is going to be, the number of tuples generated in our output of an operator is going to be the main cost. Now we've got to figure out how we're actually going to derive that estimate. And this estimate's going to be based on three factors. The first is going to be on the access methods available to the table, right? So again, this is whether doing a sequential scan or an index scan and how selective that index could possibly be. Then it's going to depend on the distribution of the values for the attributes that you're doing a look up on, right? If you only have a single attribute or single value for the attribute then the distribution is, you know, it looks one way versus if it's a uniform distribution it looks another way. Then the third part is that then you want to look at what the predicates you're using in your where clause or your query to now do your look up and now it's a combination of all these things. I know what our predicates are going to be, I know how I'm trying to do my evaluation and then I know what my values are going to be for the attributes I'm looking up on and then I know what the access method that I'm using and the combination of all these things will determine how much data I'm going to have to process and produce as my output. So as we talked about before this is easy to do for simple queries, right? These are the sergeable queries that we talked about before. We know exactly what index we're going to do our look up on. We know whether it's a secondary index or a non-unique index for primary key index. So really simple things, we know exactly where we need to go and get the data that we need. For doing more complex queries this obviously becomes more problematic because the estimations are going to end up getting worse and worse and worse as we go up from the bottom to the top of the tree which we'll see as we go along in a second. So the measure we're going to use that the way we determine how much data we're going to have to access and produce as our output we call this the selectivity of the predicate. Again this should all be a refresher from the intro class but the selectivity essentially we're going to model this as the percentage of the data that's going to be accessed for a particular predicate and we'll model this as a probability or the probability that the predicate will for a given tuple will return true or be satisfied. And because now we can model this as a probability we can apply other statistical methods and basic probability math to do calculations on what the true selectivity will be for multiple predicates combined together in a conjunction where we'll ignore disjunction but you can usually rewrite those as separate conjunctions. And the way we're going to actually generate these estimates for the selectivity is using a combination of things. So we could have domain constraints where we know what the values could possibly be for a particular attribute. We can have the pre-computed statistics per block. So like we talked about this before these Oracle calls these zone maps, IBM calls them synopsis tables. The basic idea is that for every block you have in your database you can pre-compute the min, the max, the average, the sum, the count for every single attribute. And then that way you don't have to do a scan of the entire thing, you just look at the zone map and say could there possibly be a record in this data that would satisfy my predicate and therefore I need to look into it. And typically what happens is the zone maps or the synopsis tables are not stored directly in the block. These are stored in like a separate data structure or a separate table. So in a disk based system, if you need to look at see what the zone map is for a block you don't have to go fetch the block, you just look in memory and it's there. The last three approaches though will spend a little bit more time talking them on. So the histograms we talked about in the intro class is basically maintaining the frequency of different values in your attribute. We'll talk about how to do approximations which basically is the same thing as doing histogram without having to do an exact statistical lookup. You can have a mathematical formula that computes what the probability would be that for your selectivity. And then we can do sampling which is the approach that the hyperguys touted as the way they're gonna get better performance or better, more accurate in cost estimates in their optimizer. So if we're gonna say that we need to have really accurate histograms or statistics about our data in order to compute good estimates and therefore generate good plans, the big problem is obviously that these things get out of date. And this is the big idea behind the paper you guys read was like, if you show how your statistics are wrong then you end up choosing bad plans. So you maybe think, well, can't you just, as you scan the table, compute your statistics on a fly or update your histograms as you go along. And certainly you can do this and it's not a new idea. It does have some problems though. There one system that I know that actually does something like this is actually IBM DB2. So they had this thing come out in the early 2000s, this component called Leo, the learning optimizer. And the basic idea of how it works is that say you have a query come in and then once you do a scan on a table you do the normal query planning and cost estimation as you normally would with any other query. But now as the execution engine processes the query and scans the data, you check to see what the data looks like as you read it. And then you check to see whether the estimates that the cost model generated for that particular query don't match with what you're actually seeing in the real data. Like if I assume my selectivity is gonna be one way then I come through and scan and I see it's another way then what you wanna do is then you wanna sort of send Delta's records or incrementally update the histograms or cost models, cost model for the system to say you told me before that the data was gonna look like this but it actually looks like this. And the idea is there that as you scan the data you're sort of killing two birds with one stone, you're scanning the data that you need to as you process the query and also updating your statistics. And the idea is that you can get better more accurate statistics as you go along as you do more work. So in contrast with the way systems normally work is that there's usually a background job, a cron job or whatever that's fired off every so often that'll do complete sequential scans to update the statistics. So you ever call the analyze function in SQL that's essentially what it's doing. It's recomputing the statistics by doing sequential scans. So the idea here is that rather than calling analyze all the time, you just let the engine update its things incrementally. So I've never actually used DB2 cause it's kind of a pain to set up. But everybody actually that I talked to that has used this thing, Leo, they told me that it never worked correctly. It was always got, it actually always made things worse. And so I've had at least two or three DBAs tell me that the first thing they do when they install DB2 was they turned this thing off, right? That was four or five years ago. I don't know whether it's actually gotten better. Like it's still in DB2, you still can turn this thing on. And in the paper you guys read I don't think they're actually using this feature. It's a neat idea, but it'd be interesting to see why this doesn't actually work in practice. Or is it just implementation issues or is it something more fundamental? So the, instead of maintaining accurate histograms you can use approximate data structures to compute some kind of estimation for what the selectivity is gonna be. So these are called sketches. These are gonna, think of this like a bloom filter. Like a bloom filter can tell you approximately whether something will, you know it's some key gonna be in a set. The, this is sort of the same idea. So instead of assuming whether something's in a set or not you actually compute estimations on histograms. So you can compute the number to sync items. You can compute quantiles, frequent items and other sketches. And so the idea is that these are sometimes called streaming algorithms. So as you get new data coming in and you update your database you can propagate these updates to the sketches online. And then they will, and they'll be, you know almost as accurate as computing exact histograms at the analyze function but they have very low overhead to actually maintain. You just do this on the fly. So the only system that I know about that does this is actually splice machine. Splice machine was actually founded by a CMU alum. And he came and gave a talk in the interclass last semester. And he made this point that like when they, when they were sort of trying to scale their system and support more complex queries they were doing what everyone else does to compute exact histograms with the analyze function. But it turns out again they were getting terrible query plans because the histograms were always really really off. So then they end up switching to sketches and that allowed them to get more accurate query plans and scale, you know, scale up to supporting 75 table joins. So remember I said last class like it seems to be that the way to measure the quality of an optimizer in a cost model is how many tables you can join in a query. And the Orca guys told me they can do 35. The splice machine guy told me that when they switch over to these approximations they can do 75. And the memcable guys told me they can do 130. So I don't know what I actually believe but I actually agree this is actually a really good idea to use this instead of accurate histograms. So this is something we should maybe explore in our own system. So the link here is for the Yahoo sketching library. So this is actually written in Java so you can't use it in our system. But what I like about the documentation that they have is actually really thorough and explains what it is that they're actually doing. So there's probably a C++ library that implements some of the same sketches, data structures, but that's the one I think does a good job explaining what it is if you want to learn more about it. All right, so then the last thing the way you can get estimations for selectivity is to do sampling. And the idea here is that you don't maintain sketches, you don't maintain any histograms at all. It's just when a query comes in you extract what the predicates are and then you do like a mini query on the actual, on the real data on the actual tables and then you use that to estimate what the selectivity is gonna be for your predicates. And the idea here is that rather than trying to approximate what it's gonna be you just go look to see what it actually is. So this obviously gets tricky when you start having more complex joins because now you need to do sampling on samples on samples. It's sort of these things sort of become nested. But this is another interesting idea that I think would be interesting to explore as well in our system. And what you would think this would be kind of slow. So this would be actually slow to do if you actually actually do this as regular queries. But because again you're only doing this to do estimations in your optimizer you can actually run the little sampling job in read uncommitted isolation. So you don't have to worry about any transactions. You don't have to worry about doing, seeing dirty reads from uncommitted transactions. You just let it go and read whatever it wants. Become tricky now how to jump around and do proper sampling across your entire table. If you're using a pen only cause you may start scanning and seeing tuples that are not visible to you. If you're doing the Delta record approach you know how to jump to the master records and see them right away. So the overhead of actually doing this approach depends on actually what the verging scheme you're using and if you're using MTC. But in general again you don't have to worry about you can transactions, you know what about visibility issues and other things like that. So again I think there's another interesting idea to approach and we'll see this again in a second when we talk about the actual paper this is what Hyper does and this is what another commercial system does. All right so now if we have our selectivity so now what we can do is we have to generate what the expected output is gonna be for every single operator, right. And you do this by multiplying the selectivity of your predicates times the number of tuples you expect coming into for its input, right. For join this is obviously taking the two tables you're trying to join together and feeding that into that. For scans it's based on the, you know it can, based on the access method for the, for how you're actually trying to do your read. And so the reason why things are gonna get really wrong the estimates are gonna get really bad is because all of the cost models are gonna be based on compute the cardinality of an operator of its result based on three assumptions. Now what I'll say is that when you read any database textbook they always list these as the three assumptions that every model actually uses. And what I'll say is that I don't know whether this is true but I feel that at this point in time the commercial systems are probably not making all these assumptions, right. That they've, you know there's been so much energy and so much money spent on making really good cost models I can't imagine that they would always do this. So I feel like I can't prove it but I feel like what we described to you in the intro course in the textbook is not actually what is actually going on in practice. But for now we'll assume this is the case and then we'll see how this is gonna foul us up as we go along in the talk. So the first assumption we're gonna make is that our data, the distribution of our values and our, in each attribute is going to be completely uniform. So the probability that a given value exists for a tuple will be the same for all other possible values. Now obviously we know that this is not always the case and so what they typically do is they maintain a frequent item set or some extra data structure to say here are the items or the values that occur the most often. They're sometimes called the heavy hitters and that way when you do your estimations you know that this thing occurs the most often and therefore you shouldn't assume that it's one over N. The probability that it appears is one over N like all other values. The next thing we're gonna assume is that we're gonna have independent predicates. So that means that the probability that a predicate will satisfy a tuple will be independent of all other predicates and therefore again based on the simple probability math that means you can multiply those two probabilities together and that'll determine what's the likelihood that there'll be a tuple that satisfies all of them. And again obviously we know that that's not true either which we'll see in the next slide. And then the last one is the inclusion principle. This basically says that say I'm doing a join on two tables I'm gonna assume that whatever the key I'm doing my look up on the inner table will exist in the outer table. Now if you're doing a join on two tables with foreign keys then you know this is true because you know that the child table has to have whatever keys in the child table has to exist in the parent table because otherwise that would violate referential integrity. But you don't always join on foreign keys. So you can't always assume that this thing is valid as well. So what I wanna do now is I wanna show a quick and easy example which I think I showed in the intro class but I think it's relevant to look at again about how if you don't, if you assume this these two principles here you end up actually getting really bad estimates. So the example that everyone always likes to use is this database of automobiles or cars. And this actually comes from an example of an article written by Guy Lohman who is the same guy that did the DB2 Leo optimizer and did the Starber stuff that we talked about last class or two classes ago. And so say you have a database of all these cars and there's gonna be two attributes for these tuples. The first is gonna be the make, Honda, Toyota, Tesla, BMW, whatever. And then the other one's gonna be the model. Accord, Corolla, Taurus which they don't make anymore. Right, so again, the model of the car is gonna be. So if you wanna compute the selectivity of this predicate here where make equals Honda and model equals Accord. If you assume the independence and uniformity of principles from the last slide then the selectivity of these two predicates is one over 10 for make as we have 10 makes so we assume this occurs once. And then times one over 100 because there's a hundred models and we assume that one of them has to be Accord. So in this case here you assume the selectivity would be 0.001. But we know that Honda's the only car company that makes an Accord. So these two predicates are not independent. They're correlated. So the true selectivity of this predicate is actually one over 100 which is 0.01. So we're about to order magnitude off between these two estimations here. So if I have 10 million tuples then this one will produce 10,000 and this one will produce 100,000. And that's a pretty huge difference in your estimations. For a really, really simple example. So as we go along as we see in a second when you have really bad estimations you make bad choices not only on like what algorithm you want to use or what plan you want to use but also now how to allocate things ahead of time like the size of tables for hash joins, right? You grossly underestimate what the size of these outputs going to be and that ends up hurting your performance. So for this particular example the way to get around this is actually used a technique called column group statistics or correlated statistics. So the basic idea here is that you can tell the database system for these two attributes, in this case make and model they're not independent, they're actually correlated so only compute their histograms or statistics about them by considering the combination of all those two attributes together. And so then now when you want to do your estimation to say what the selectivity is for that particular query you would choose the right one at the bottom rather than the one where you assume that they're independent. So what I'll say is that as far as I know these are only supported in the commercial systems. So I think this first came out in DB2 but I think a bunch of other systems have this now. And then when they first came out I don't know whether this is actually still true when they first came out this was completely manual meaning this is not something the database systems would be able to figure out for you you as the DBA have to go in and say make and model or correlated compute them as a column group statistic. It's not something that you can automatically figure out because if you think about how complicated it would be you'd have to look at every single combination of every single attribute pair for every single two attributes in a table, right? And that would be clearly be exponential. So there's not something that the system is gonna compute for you you as the human have to comment and tell you you have to tell the system that you wanted to do this. I don't know whether the newer systems they can do this automatically but it would be an expensive computation to do if you had the system do it for you. All right, so to see why having bad estimates can really foul us up I wanna do a really, really simple example. So say we have our query and then this is the relational plan it doesn't matter whether it's a logical plan or physical plan, right? For now we just look at it in this context here. So the way we wanna compute the cost the first thing we're gonna do is compute the cardinality for all our base tables down here, right? And this is really just taking in case A and C because there's no predicate on them it's just the cardinality is just what's the size of the table. In the case of the lookup on B it's gonna be the size of B times the selectivity of this predicate here where BID is greater than 100. So then once we have that now we wanna go through and compute now the cardinality of all these joins. So in the case of A join B it's taking the multiplication of the two cardinalities divided by either the max of the selectivity of doing the join between the two tables or the selectivity of running this predicate here. And then likewise the skin gets amplified going up now you wanna compute the cost of doing the join A, B and C then you're taking now the selectivity that you repeated from this one and putting that into the formula of compute the selectivity of the entire three-way joint. So the main thing to point out here is that if I get the bottom part wrong my selectivity estimates are bad here then that's gonna get fed into this and then I'm gonna compute some statistic on that and that's gonna be wrong and then it goes up even further and now this is gonna be even more wrong. So the more tables you have and the longer you are it just keeps getting bigger and bigger and bigger the scope of how incorrect things are. So in a really simple case, a three table join it's even that one can be really bad if the bottom part is really wrong. I'm just showing you that these things again can amplify and reverberate up into the query plan because you have to compute your statistics of what's going on here based on the statistics of things you estimated down below that. All right, so now we can discuss just how bad these estimates are actually gonna get. So the reason why I had you guys read the paper is not because it describes actually how to do a cost model because it shows you how these estimates can really affect what the cost model estimates are gonna be. We can use the same formulas that we had from the intro class to say how many people is only gonna produce my output and I can derive them from whatever it is the algorithm's gonna be and the selectivity like I showed here. But now we can show what happens actually when you get these things really, really wrong. And so the papers from the hyperguides and what they did is they developed this new workload to do this evaluation called JOB or JOB so the join ordering benchmark. The idea is that they're gonna load in a table with synthetic data that is skewed based on some work real world database distribution and then they're gonna generate a bunch of the different queries that have different join orders and they wanna measure what the difference is between what the different systems cost models will estimate the cost it will be for a query plan versus what the real estimate should actually be or what the real cardinality selectivity would actually be. So what they're gonna do is they're gonna do prepared statements on the queries and then extract out what the cost for the query plans are and look at the selectivity so all the different operators and they compare that with the computed selectivities that they derived from the real database. So they load the database, then you run analyze or whatever it is you need to do to have the system compute the stats so you know that it's doing, it has all the information that could ever collect for the database is already available to it. You're not worried about doing updates, you're not worried about sort of propagating changes to your statistics. You assume that it did whatever it needs to do to figure out what these histograms or whatever statistics it wants to compute and it uses that to do with estimations. So there's two graphs I wanna show you here. So for this one here, they've taken five different database systems and again they're measuring how wrong or the difference between the selectivity of the estimates and the operators versus the selectivity and the real database and comparing how off they are. So you have 500 database systems and along here on the x-axis you have the number of joins they're doing in the query. So you have one to seven tables. And in the y-axis is whether they're overestimating the size or underestimating the size. So you can think of this red line here in the middle, if you're at this point then you're 100% accurate to what the real data looks like. And then if you're below that then you're underestimating, you're estimating that it's smaller than it actually is and then if you're above that you're estimating to be larger than it actually is. So right off the bat you see the major trend for all five different database systems is that they all go down as the estimates or the estimates get more wrong or more incorrect as you increase the number of tables and they're all underestimating the size of the outputs. This is because of those assumptions that I talked about before. If you assume that your predicates are independent then when you do your multiplication of the selectivity you're gonna assume that they're more selective than they actually are and that's why you're underestimating what the size will be for the different outputs. So there's a couple trends we can look at. So the first one is that for this second database system here it actually does the best out of all of them and it does actually pretty well up to two table joins so two joins on three tables. But then obviously as you get more joins it gets worse and worse and worse. But the slope of its degradation is much less than the other guys. And then these three guys here they perform basically pretty much exactly the same. The magnitude of their underperformance is just about the same. And this particular system here the breadth of the scope of the variability in their estimations is much tighter than the other ones. This one here it's pretty tight whereas this one here it's much larger. And then you have this guy in the middle here and basically after doing three joins it just quickly falls apart and this one actually does the worst. So again the main takeaway here is that in a real system all these cost models are underestimating the size of the outputs of the operators because again of the assumptions that I talked about before. So now in the paper they told you that this first one here was Postgres and the last one here was Hyper. So does anyone want to take a guess what these other three are? It's the bad one Leo. So he said the bad one's Leo. Anybody, all right, let's start with the first one. So who think, take a guess what the bad one is. He said DB2. Anybody else? He says my SQL. All right so this one's the best. What do we think this is? Everyone says SQL Server, anybody else? Okay, SQL Server Oracle DB2, all right. So again I don't know whether they turned on the learning optimizer for this. I don't think so. And again SQL Server is very, very good. They've had, IBM does the same thing. They had the research team spend a lot of time working on the system. They publish a lot of papers about what they're doing. I will say DB2 actually publishes more papers on what their cost model and optimizer does. The cost model estimates and the optimizer does. Where SQL Server has other derivative papers based around it. And again the main takeaway here is SQL Server is actually doing the sampling technique that we talked about before, that they're using in Hyper. It's an in combination of the different, other histograms they also may be maintaining. But the key thing they're doing here is the sampling. As well as doing cascades. So again, the main takeaway here is that all these different cost models are underestimating what they actually should be doing. And that can cause a big problem in terms of performance. And so in the next experiment they did, you can see what the slowdown you're gonna get when you have incorrect estimates. So what they did here, they took Postgres 9.4 and they put in hooks into the cost model that when they wanna do the estimation what the cost for a query plan would be, instead of using his histograms to do an estimation, they would then inject in what the actual real cost, real selectivities were gonna be. So this is showing how super accurate you're gonna have. They would run the queries with exactly accurate query estimations when it generates plans. And they compared that with what the query would do when they run it with histograms and other things. So the way to read this graph is, along the x-axis is the percentage of the slowdown compared to running the query with the true cardinality. So along here, like from this point here to this point, this is saying that 60% of your queries are gonna be 1.1 to 100x slower than the actual real cost of the query. So this is saying that when you have bad estimates, roughly around 5% of the queries are gonna run 100x or more slower than what they actually should be. The reason why this is happening is because cause you're underestimating what the selectivity or the cardinality would be for these operators, the optimizer ends up choosing a nested loop join when it really should be doing a hash join. And the reason why it's choosing a nested loop join, even though we tell you in the undergrad class that that's always usually a bad idea, is that in some cases, if the selectivity is quite small, meaning you're only gonna grab a small number of tuples, the nested loop join is actually gonna be faster than doing a hash join because you don't have to allocate a hash table, you don't have to set everything up, you don't have to hash any of the values when you do your comparison, you just stick things in the output buffer and move along. So what's happening here is that, again, when you have really bad estimates, it ends up choosing a nested loop join which is actually the worst thing to do. So then what they did is, then they went back and they turned off the ability to use true cardinalities, they let the optimizer use its regular cost model based on those histograms, but then they went ahead and disabled the nested loop join. So this is what you get when you tell it, use your regular estimates, but never use the nested loop join, therefore it always uses the hash join. So now what you see is now you barely have any more of these queries running at 100x slower, you're now moving closer to this side of the graph which is where you wanna be, right? You're saying that now you're saying at this point here, about 50% of your queries are running within the exact same speed you would get when you have the true cardinality. So now the question is, for all these other ones here, I'm doing the hash join the way I should, I'm not doing the nested loop join, why am I actually getting slower? Well it turns out as I said before earlier, you're using these estimates not only to estimate what the cost is gonna be to execute a query, but you can also use them to determine how much size of the hash table you should pre-allocate to your hash join. So if you underestimate the size of your output buffer, you're gonna pre-allocate a small hash table. Then when your hash table gets too big because you underestimated it, you have these long collision chains. Now when you do a lookup on a probe, it becomes really expensive because now you're doing a space to get a scan along the length list to find the attribute you're looking for. So then what they did was, in Postgres 9.4, they didn't have the ability to do dynamic hash table resizing, but in 9.5 they did. So what they went and did was they backported the fix in 9.5 to 9.4, and now you see when you only use hash joins, you only use hash joins because you have correct cardinality estimations, you get better performance here. So now again, no longer you have anybody being 100x slower, now everything's more closer to where you want to be, being exactly where you would get if you had true cardinalities. So again, this is another example showing that it's not just picking a bad plan, it's also pre-allocating things incorrectly. That can be caused by having bad selectivity estimates. So what are some of the lessons that the German guys talk about? So this is actually, Victor sent me this in an email last year, and these are sort of the main takeaways that they got from the work that they did. So one of the arguments that they make is that even though we've been spending all this class talking about how to have in-memory databases and memory indexes, we'll talk about vectorization and compilation after we come back from spring break, all of that is for naught if you generate crappy query plans. You can have the fastest engine in the world, but if you're gonna generate nest-loop join queries that are really, really slow, then none of that is, you know, none of all the work you did to make the thing run fast is just thrown out the window. So coming up with a good join ordering is really important. The other thing they talk about too is that you do sort of assume that your cardinality estimates are gonna be completely wrong. So ideally you wanna use possibly sampling to help you figure out things a little bit better, but then you also wanna try to use, when you implement your operators and your execution engine, don't have them rely on exact numbers from the cost estimates. So again, in the hash table example, in the early version of Postgres, it had a fixed size hash table that was based on the estimates, but in the newer version they could dynamically resize the hash table if they realized later on that the estimates came out to be incorrect. So you wanna try to do this as much as possible in all your implementations to avoid having to, you know, maybe stop the world and recalibrate all your internal data structures. So currently in Peloton, we cannot do this, but in our new LLM engine, we can do some of these things. The other thing that he also argues is that in some cases, if you have good query plans and you have a fast engine, just doing complete sequential scans on tables, if everything's in memory, may actually just be faster than having complex indexes, right? So rather than worrying about what index to use for doing these lookups, you just pay, I'm gonna do a sequential scan, as long as I get the join order correctly, join order correct, I can do sequential scans and that's gonna end up being more faster than more nuanced, more complicated approaches. Then lastly, they argue that trying to have really accurate or fine-grained cost models, like in the DB2 case, where they had all these different components, is actually ends up being a waste of time, right? And you're better off trying to estimate what the cardinality is gonna be for your operators, you know, sort of what I said at the beginning, like I'm gonna ignore all the sort of complex physical costs, and I'm just gonna say how many tuples am I gonna generate? If I can do a good job at estimating those things, then that's gonna improve my cardinality estimates and that makes all these other aspects to sort of fall into place more easily. All right, so what are my parting thoughts on this? So the, again, the number of tuples you're gonna end up processing in your tuple, or sorry, in your operator, ends up being a reasonable cost model for MME databases. Thing will change if you bring back disk, as we'll see later in the semester, or things can change if you make yourself distributed, but for our purposes, when we only focus on the single node, this is gonna be good enough. But as we showed, in order to do this, you have to have good estimates for the selectivity and the cardinality for your operators, and having this be accurate is difficult to do. And so this sort of ends the discussion of the last three lectures, we're doing query optimization for the course, and the main takeaway is that in my opinion, doing Cascades as your query optimizer, optimization search strategy, plus doing tuple cost estimation, or the number of tuples you have to look at, the combination of those two things is the right way to do a query optimizer in MME database system. I think that makes it easier to engineer, makes it easier to get good query plans than the other approaches that are out there. And as far as I know, most MME databases will do this, and then some of them do Cascades, some of them do the stratified search. All right, so any questions about query optimization and cost models? Okay, so in the last 15 minutes, we'll talk about extra credit and life lessons for working in large code base. So I'm giving out an extra credit assignment that's optional, where you would pick any databases when you want, and you can basically write a Wikipedia-style article about it. So we're building a new encyclopedia, online encyclopedia here at CMU, where basically it's the database of databases. So I have a list of 400 databases that have been released in the last 20, 30 years. Actually, the new one got released today. They come out all the time, timescale DB got released today. So we have this huge list and there's no sort of centralized resource that describes what they actually do. If you go try to read the Wikipedia article about databases, the problem is it's very marketing heavy. Some guy that works at the company went and wrote the article and says how great their database is. It's hard to do an exact comparison and understand what they're actually doing. So rather than have this sort of be free form text like a Wikipedia, we actually have a website we've been building where we provide you with a taxonomy about the different aspects of the database system. So for example, when you want to discuss what the concurrential model they're using, rather than just writing free form text, oh, they use two-phase locking, you can select they use two-phase locking and then you can describe exactly how they actually do things or what makes them different than other systems. So what you'll want to do is that, I'll send the announcement today, look at the website, but there'll be a bunch of these different categories of features that you can have and you can select what the different options that the system provides and then you write a little synopsis describe how they actually do certain things. So the website is dbdb.io and so the way we're gonna divide this is that it's an individual project. So it's not being done in the group you're using for project two and three. So to make sure that everyone, no one writes the article about the same system and actually we have some of the systems that people wrote about last year, I will post a sign-up sheet or spreadsheet on the on Piazza and you can go ahead and select what system you actually want to work for and it's a first come first serve whoever wants to pick MySQL first gets MySQL. So I will say though that if the way I'm gonna grade this is that if you pick a widely known database system you pick a popular database system like MySQL for example, then I would expect the article to be very comprehensive and very well written because there's a ton of articles out there that describe what MySQL actually does. If you choose a more obscure database system like timescale db because it just came out today, fauna db is another newer one that came out, then you'll have to do the best you can to find the information that you need but I also encourage you to go contact the actual company and be like, hey look, what did your system actually do? And I'm sure that they can help you. So the way the website sort of looks like this is that say case of MySQL here, again you have the different features that a system could have and then you can select whether they have it or not and there are different options and you write a little description about what they have. And then we have like in a sort of a search page where you can select all of these different options about show me all the systems that have two phase locking, show me all the systems that do NVCC or does command bulging, and you can filter them and see which ones have the different aspects of it. So the idea again is here rather than having like which beta has they have a list of the data systems that support NVCC and it's completely ad hoc, this is sort of a more structured way of finding these different things. All right, any questions? Okay, so there's a ton of different data systems out there. I always like to show this little graphic to say that like no one should have a problem finding an interesting system to talk about. And instead like the people that actually built in them are super excited about that people are interested in their system. So you contact them and they ask them, what did your system actually do? They'll be glad to help you. So the last thing I'll say again, I brought this up to get in the class, obviously this needs to be written by you, you're not allowed to go copy text or images that are on blogs and other people that are written, right? We will do some cursory search to make sure that you don't, you're not plagiarizing other people. You should ask me whether you want, if you want to copy an image, ideally it'd be better if you just recreated yourself. And then you provide an attribution to say where you got the original design from. If there's something you think that you absolutely cannot copy yourself and you want to use it, please contact me and we can discuss about whether that's the right thing to do, okay? So any questions about the extra credit? All right, cool, all right. So now I'm going to finish up talking about, again some basic advice I can give you about how to work on a large code base. So I first want to say that like, I'm obviously not old, I'm not like a 30 year veteran that has spent years and years and years working on systems. But I have some time working on two large database systems both here at CMU and Graspool. And before I started Graspool, I actually worked on a large distributed system, a batch processing system at University of Wisconsin. So I have spent some time reading large code bases and I'm working on that. I've also spent some time on reading what I call enterprise code as part of some doing consultant work where in like an IP lawsuit, they give you access to the source code from some product and you have to read them to figure out what's actually going on and see whether it violates a patent or not. So I have spent a lot of time doing this and so I've looked at what people would call commercial grade code. It's usually not as good as the open source stuff, right? All code's bad, but. So I'm claiming, everything I'm going to talk about here I'm not claiming that these are like, these are gospel, right? I'm only going to discuss things that I've done in my life that I've helped me look at and get acquainted with a large code base. So the reason why you want to get comfortable with looking at being able to do this is that when you leave CMU, I can almost guarantee you that no matter where you go, you're never going to be working on a system from scratch. You're never going to say, all right, I'm literally going to start main.c or main.h and start writing the code complete from scratch. You're probably going to get hired here at a company and you're going to be asked to get involved in an existing product. And even if you say you build a new database system, there's all these third-party libraries and other things that you're going to want to be using, like no one's going to write their own standard IO.h. You use what's out there. So it's really important for you to learn how to get comfortable with understanding code very quickly and independently, because you're not only going to have documentation, there's not only going to be comments in the code to describe what's going on. So it's important for you guys to be able to do this. And so what I'll say too also is when, so last year was the first time I taught this course and before that I was sort of planning ahead of time for a year about all the things I wanted to talk about during the semester. And so I would talk to my friends at these database companies, right? Both the startups and the large ones. And I said, hey look, I'm teaching this course for the first time on database internals. What do you want me to teach them? If you wanted to hire someone from CMU that have taken my database course, what's the one skill that you want them to get out of the course? Do you want them to know how to do B plus trees? Do you want them to know how to do locking, do latch-free stuff, right? And without me telling them what other people have told me, they all basically said the same thing. They all said that they're very interested in hiring students that can show up, get started on a project right away and start working independently on a large code base and be comfortable with this. So I guarantee you that again, when you leave here, you will have to do this and that the better you are, more comfortable you are doing this, the more successful you will be. And actually there's one company that I don't want to name from one database company. They said that the number one thing they do and when they do interviews for new hires, the number one sort of test they provide that is the best metric to determine how successful the applicant will be is they give them Memcache D, the source code for Memcache D, which again the company doesn't sell and they don't work on, but they just say here's Memcache D and they ask the applicant to write a new feature in like two or three hours. And you're allowed to go, you know, read Stack Exchange, read whatever you want online, but you have to sit there and figure out how to add some new feature. And they said the ability for someone to do that has been the best metric to determine how good someone's gonna be. So again, this is really important to learn how to do it. And there's not anything magic I can tell you how to do this, it's just sort of practice, right? There's not something you can buy from IntelliJ or some magic software you're gonna buy to make yourself super, super good at this. It just comes along with practice. All right, so the wrong thing to do, maybe start off with this, is just go read the code for the sake of reading code, right? I have this happen all the time. I have students come to me and they say they wanna work on my database system, they wanna do research with me and I say here's Peloton, here's the source code. Go get started on something with it. And what happens is the student comes back a week later and says, oh, I just been reading the code, I haven't written anything yet. Those students always turn out to be not as good as the other ones, right? Because if you just read the code, you're just reading for the sake of reading, you're not gonna get as much out of it, right? You're just reading the C stuff, but it doesn't really mean anything to you. So I think it's really important to get started right away and actually modifying the code and actually trying to do something at it. Even it's really trivial, because that's gonna force you to understand all the different moving parts in the infrastructure and allow you to get more comfortable with what's going on. So there's two ways you can do this, right? One is to start writing test cases. It's not exciting, it's not sexy, but it's an important thing to do. So by writing a test case, and I mean like unit test, like at the C++ level, rather than like regression test where you just input the SQL query and see what comes out, by writing unit tests, it's gonna require you to understand what the functions you need to call requires you to understand how the different parts work together and you'll see as you go along because I need to call this function and that function need to provide this input. That's way better than just reading it for the sake of reading it, right? Because again, you're internalizing what's actually going on. And the nice thing about this is that if you add new test cases as long as they work, right, and are correct, then people are gonna love you. Like no one's gonna complain like oh man, you spent the first day writing test cases, right? No one's gonna fire you because of that, right? And again, so it's kind of a win-win situation. You're in good graces with people you're working with and you're actually making the system better and you're making yourself understand the code base more. The alternative is to do refactoring. And this is basically where you find some part of the code you think you're gonna be working in a lot, right? So say again, if you're gonna work on the indexes, maybe you could look at the index API that we have and you go through and you start sort of cleaning up the code. Right, maybe you reorganize functions that have gotten too big, maybe you add comments, maybe you clean up some hacky code in there. And the idea here is again, by just cleaning up the code, making modifications, you're internalizing what's going on and it makes you understand the code base a bit more. The danger about this is though because you're now actually modifying the sort of core functionality of the system, you need to be careful that you don't actually break things. Right, because there may be assumptions in the code that you don't fully understand and maybe somebody had done something a certain way even though it may look like a hack, they did it because it needs to work with other parts of the system. So unlike in the test cases, you can write them and if your test cases crashed and who cares, you don't break the rest of the system. So for this part here, you kind of need to be a bit more careful about what you're doing. Another key thing that's important to get acquainted with right away is the how the system, how the organization is actually gonna build and test their code. Google is famous for having this really, one giant source code repository for their entire company and then they have their own build process and pipeline for when you wanna make changes and review and process. So getting acquainted with that right away is you'll see how everything is done and if there's documentation available, great, you can go ahead and read that. If there's not documentation available, then you can kind of be the person that can start it because again, for you having to write down how exactly you're gonna have to do certain things to build your code and test it, in order to write about it, you need to understand it. So by going through the process and documenting what happens, you end up understanding it a bit more than you would be from just copying a script and hitting a button. All right, so that's basically at a high level what I can say about working the large code base. And so what I would suggest for you guys when you get started on project three, I can guide you to roughly where in the code is gonna be that you're gonna need to modify, but it's up to you to get in there and start making changes and start expanding the code to actually do what you need to do. And again, you can either write test cases or you can refactor some of the code that we have to make it better and that'll help you understand what's going on. So in the case of, it's the one team's been working on the new catalogs, they've been refactoring the catalogs API to prepare themselves for doing more complicated things for project number three. Another team's already been working on the optimizer and they've been adding and missing parts that you need to feed information from the catalog to the optimizer. So again, by making changes right away, it's the best way to get started in the system rather than passively just reading source code. Okay? All right, so today's Thursday, next week is spring break. So enjoy whatever it is you're doing. Please come back alive. When we come back on the first Tuesday, everyone will have to do the proposal for project number three. So all of you that are in groups should get together and talk with your group members and decide what it is that you actually wanna do. And if you're not sure, or if something would be interesting or not, you're not sure whether something would be too hard or you're just looking for feedback on your ideas, please contact me because I will be around here all next week. I don't go on spring break. And so you can come meet me and we can discuss what you actually wanna do. Okay? And the project proposals will just be five minutes. It's not meant to be an exhaustive explanation about here's all the things you wanna do. It should be really straightforward to say, this is what we're gonna do, here's the code we're gonna change and here's how we're gonna test it. And on the webpage for the project three is up, I haven't linked it yet to the schedule. So you'll see what I'm expecting for you to do in your proposal. Okay? Any questions? All right guys, enjoy your break.