 Okay, let's get started. So real quick, there may be sort of high level questions about project number two. Again, it's due on Thursday. Dana will be having office hours immediately after class today, so if you have sort of technical questions or implementation questions, you can go talk to her about that. Okay? All right, so today is part two of our discussion on optimizers. So now we're going to be discussing the Cascades Optimizer Implementation in a little bit more detail, and then we'll sort of go over at a high level what the ORCA optimizer, in the paper you guys read about what it was actually doing, because ORCA is a sort of modern implementation of the Cascades framework. And so it's not so much interesting to read the paper about how exactly you do Cascades, because we sort of covered that last class, we'll cover that today. But I think it's an interesting paper to read to see about the things you have to sort of maybe worry about when you actually put an optimizer in a real system. And then today is also the release of project number three. So update the website later on today. But I want to go through at a high level some of the different project topics that you can consider with your group, because on the Tuesday after spring break, all of you in your group will come and do a five-minute presentation about what your topic is going to be. So I just want to go over to high level again with the kind of things you can explore. All right, so just as a refresher for where we left off on Thursday last week, we're talking about now queries coming in, and we convert them into some kind of logical plan. And the goal of the optimizer is now to figure out the best physical plan for that query that produces the correct result. Meaning the plan generates the answer that you were looking for. And it also has the lowest cost. And remember, we said that cost is highly dependent on what kind of database system you're building, what the operating environment is. And so we'll discuss more about that on Thursday. And for our purposes in this class or this semester, we're really focusing on single node in memory database systems. So it ends up counting the number of tuples you have to read or write is the main thing we want to use. We also talked about how this is the hardest thing you can possibly do inside of a database system to generate these query plans because the problem is known to be NP-complete. So no optimizer is actually going to implement a, even though it's called an optimizer, it's never really going to find the optimal plan, just because the search base is so large. And so instead, what we'll use a bunch of heuristics, like we saw in the system R case when they were only looking at left deep trees, we use these heuristics to prune out some candidate solutions or candidate plans to reduce the number of things we have to look at. And again, the way we'll determine whether one plan is better than another will have some kind of cost model to be able to do an estimate of what we think the database system is going to have to do to execute that query. Because we can't possibly actually just run the real query and see what the result would be because if we're looking at possibly a million different combinations of query plans for a single query, that would be just too slow. So again, the overview of where we're at is again, the query comes in, we parse it, we bind it. And for our discussion from last class in this class, we're really focusing on this part over here. This is where we actually do the search to find a better plan. And I said before that the tree rewriters, depending on what kind of optimizer framework you implement, this may be inside the binder, it may be inside the optimizer as like the first stage in a stratified search model. Or you may be like in the case of Cascades, you don't have this at all. Because this is sort of implicit in the unified search model. So last class we had gone through five different choices or designs you could have for how you implement your optimizer. And this sort of was going in a chronological order of in time. So in the 1970s we said that Ingress and Oracle implemented these really primitive heuristic based optimizers that had a bunch of hard coded rules written in whatever the language, the code, the database was written in in the case of Ingress because they were really, you know, early UNIX adopters at Berkeley at the time, everything was in C. And so these things have some rules written in procedural language that allow you to do the transformation. This is essentially the tree rewriting phase that I showed in the last slide. And then we had, later on, the system R guys came out with something that did the heuristics that the other guys did, plus this cost-based search model or search algorithm that will evaluate the different join orderings for what you can have for your query and choose one with the lowest cost. Then we talked a little bit about these randomized search algorithms. So the simulated annealing and then the genetic search algorithm and Postgres. And I said other than the genetic algorithm and Postgres, nobody actually implements this because it's hard to be able to debug and it's difficult to have deterministic query plans. Then we talked about stratified search and the unified search. And the main takeaway from all this discussion is that in a modern database system, this is really the two different choices you have to consider, right? The stratified search is all about taking a logical query plan and using the same kind of heuristics that the Ingress and Oracle did in the beginning and just do these transformations directly. You have to do a little bit of actual work to make sure that you're doing the correct transformation. Like if you care about the sort order or something, then you have to sort of bake that into heuristics to consider that. But everything that happens during the rewriting or transformation phase is doesn't consider the cost model at all. So you just know there's some hard and fast rules that you're always gonna wanna do like predicate push down, limit push down, things like that. And therefore you don't need to evaluate the cost to do this. And then once you do that, then you go back now into the optimizer phase where you actually then do the cost-based search. And depending on how sophisticated the optimizer is, it will determine how large of a scope or how many different combinations or types of variations of the query plan you can have in the second phase. So like in system R, they only look at join orderings. In Postgres, they pretty much only look at join orderings. And in DB2, they can do other things. They can deal with CTEs and order buys and group buys and things like that in the actual search. And then, okay, so in Postgres, Postgres is actually a modern implementation or modern in quotes of the stratified search in addition to DB2. And I think it's really interesting to talk about this because in my opinion, the unified search is better, but rather than me coming and telling you guys, hey, it's better, I wanna show you why the stratified search has some problems. And in particular, the Postgres query optimizer, when I talk to like the Postgres developers, they always sort of like hem and haul and say, oh yeah, we know about its problems, we know about the issues. It's kind of a mess of a code and that no one's really willing to spend the energy and clean up and do it right. So in Postgres, and they're optimized, they had this real rigid workflow and how you're gonna do all the various transformations in for the query plan. This is the stratified search. Again, you're going from one stage to the next. So in the first stage, you're gonna do all that initial rewriting that we talked about before. Again, from Oracle and Ingress, the predicate pushdowns. Then you have the cost-based search to only find join earnings. And then everything else that you can possibly have in the query is then tacked on to the query plan after it comes out of this. So again, common table expressions, halving clauses, group buys and things like that. All of that is sort of, again, with more heuristics applied to the query plan after it comes out of the cost-based search model. And this part here, this is where either the bottom-up or dynamic programming approach that system R uses or the genetic algorithm that we talked about before. This is where that fits into this. And then any time you have sort of subqueries, you have to recursively descend into them and do all these stages again. So if you have a nested query, you do sort of the planning for the outer query and then go inside and do the planning for the inner query. But actually, you might go in the inner query first and then do the outer one second. And so again, when you talk to the Postgres developers, one of the main developers is actually here in Pittsburgh, he's CMU alum, he says like, yeah, it's a large amount of code that few people will actually really understand and few people are actually willing to make changes to because if you foul up any of these stages or you violate some assumptions that the later stages have about what the query plan should look like or have, then that breaks the whole thing. And so that's why Kudos to Postgres guys were getting so far with the query optimars that they have, but it's a really, from a software engineering standpoint, a stratified search model is more difficult to implement and maintain. So again, the unified search model that was in volcano and was in Cascades and Orca, it doesn't have this idea of let me do these, so these first steps of transformations and then I go on and do these other steps. It's sort of like you throw everything into this single search model and it just knows how to traverse the solution space and to apply the transformations and recurse down into the query plan and do whatever else you need to do to optimize it. So you don't have this sort of one stage in the next, everything's all put together. And in my opinion, this is better because this is easier to implement in some ways because you just define these rules and say, here's how I wanna do these transformations and then you just throw it all in and hit the blender button and then it spits out a reasonable query plan. Now obviously you have to implement the rules correctly. Obviously you have to have a good cost model to make this all work. But again, from my opinion, that this is the better way to go. Another way to think about this too is also like in the context of machine learning, right? So like the big craze with the deep neural networks is that like these things are just, they're super easy or not easy to use but they're like very universal. You just throw your data at it, whether it's RNN or CNN and then it'll spit out a reasonable answer. And you don't really need to understand what's going on at all the different levels or layers in the neural network. It just seems to magically work and contrast this with like, if you had a more specialized model where that you're trying to tailor it to exactly whatever it is the problem you're trying to solve, now you have to understand the modeling, if they're starting the algorithms and maybe understand the math a bit more and that's a more complicated. And then in the day, you may end up with the same result as like a deep neural network, but the deep neural network case or just again, it's like a pickup truck you just throw everything in it and you've got a good answer. So that's the sort of way I like to think about this. The unified search is sort of like the deep neural network way just you throw everything in and let the search algorithm find the right answer. Whereas the stratified search it requires you to have very low level expertise in relational query transformations and other things and to make sure that your pipeline is sort of connected all together correctly to make this all work. So again, this is why I think the unified search model is better. So we sort of rushed the volcano optimizer at the end of the class last time because the projector broke. I emailed them again. Let's see whether it breaks again. Every single time it breaks I always send the link on YouTube like here it is. Here's exactly my talk when it crashed. Please go fix it. Let's see what they did for today. All right, so the volcano optimizer came from this guy Gritz Graffi and again he was the same guy that wrote the index locking paper we had earlier. He's the same volcano as in the iterator model for parallel execution. And the volcano optimizer was this general purpose sort of optimizer generator framework where you would write these rules in this DSL and then the volcano optimizer then spit out the code you would need to actually integrate that optimizer into your database system. So the way to think about this these optimizer generators, think about it as like instead of writing the actual search algorithm yourself they provide you with a search algorithm and you just write the rules and it knows how to schedule threads and schedule the transformations to using your rules to spit out an optimal plan. And then think of like the orco stuff we'll talk about in a second that's sort of a layer above that where they're already implementing the rules for you and they've already implemented the sort of cascades or volcano search framework and they're providing you with something that you can then link in your database system without having to know how to do transformations in SQL. They do all that for you. So at this point here with volcano we're sort of a level below that. And then the game of volcano I said it was the second generation sort of query optimizer that Gertz had invented. The previous one was a thing called Exodus and it had a bunch of deficiencies where it was difficult to maintain. And then volcano was the second one and then cascades was the third and final one, right? Sort of take implied all the lessons he's learned from these previous implementations and make a better version but at a high level volcano cascades are essentially doing the same thing. So the key difference is that we're gonna be doing top down search contrasting this with the bottom up approach that we saw with system R. And the physical properties of the intermediate data that is generated for each operator will actually encode it or treat it as a first class entity inside of our planning algorithm. So remember that I said that in the case of system R you either had to have heuristics know how to enforce sort order for data or you had to bake in the notion of that data should be sorted inside of your cost model so that you can identify that data is already coming pre sorted out of a sort merge algorithm and that has a lower cost than doing a hash join. And so with volcano and cascades these physical properties for the data are actually embedded inside of like the as enforcers inside the search tree so that if you violate that the physical property of the data then you know that this is not a feasible plan and you don't need to compute the cost for it. So we'll see that in a second when we go through the examples. So the top down versus bottom up approach I think is an important distinction. And I would say this is again so it's sort of again the bottom up approach would be sort of the stratified search. I guess you could also use stratified search for this as well but in the case of the stratified search system Starburst and Postgres they're redoing bottom up and then cascades and volcano and orco we're doing top down and the way to think about the difference is that in top down you start with the final result of what you want for your query plan. I want these tables joined together and I want them sorted and then you search down into the tree to find what operators you need to add to your plan that will get you to that point. And the case of the bottom up optimization approach like in the divide and conquer method you start with without anything with just your tables and you figure out what operators you need to add in your path to get to your end goal. At a high level this seems like I'm just sort of it's parsing semantics and there's no really major difference but when it comes to actually implement this we'll see in the when we talk about the memorization table and cascades because in the top down approach you may be actually revisiting different operators because you're going up and down the tree multiple times in the bottom up approach once you sort of pass once you sort of evaluate one stage of the plan and move on to the next stage you never go back and revisit things because you've already made your decision about what was the best operators at each level. And so again for this at a high level I think the research says that there's no real both these things will produce roughly the same answer and is my opinion again I think this is the better way to implement this but it does have a storage overhead or memory overhead when you actually run the algorithm and which is what they're going to try to solve in cascades. So I want to go through the example that I was trying to show last time with the volcano it got rushed at the end because the predictor had problems but it's a really simple database with three tables artist album and appears and appears at the cross-reference table that has foreign key pointers to both of these two tables here. And so the query we're going to try to generate a plan for is this three-way join between these three tables where we want to look up to find all the names of the artists that appeared on my mixtape and we want to sort them by their artist's ID. So the first stage we knew a volcano is that we want to start up with the logical plan at the root of the surgery that contains the final result that we want. At a logical level we want our artists appears an album, tables be joined together and then we want our thing to be sorted by the artist ID. So this is essentially the enforcer property that I've talked about before. Then we were saying that anything that comes into this to reduce the final answer has to be sorted by this. And anything that would produce a result that would feed into this that did not enforce that would then be considered invalid and we just throw it away. So now we're going to evoke a bunch of these transformation rules that are going to allow us to look at combinations or pairs of logical operators in our query plan and convert them into equivalent logical operators. So taking join A to join A and B can be written as join B and A. Or we can take a logical operator and convert them into a physical operator like take join A and B and then modify it to be the physical operator that performs the hash join on A and B. So for this now we have sort of our logical indicators just as placeholders for us to know where we're at. And we'll start off by adding a transforming this join here with a scan on this table here into a sort merge join. And because we want to get data out directly from the album table we can't do that directly on a logical operator so we can rewrite that to be a physical operator say we're going to do an index scan on the name attribute and that can then be fed into the sort merge join here. And then we just sort of proceed at every single step we're computing what the cost would be and we check to see whether the estimated cost for where we're at in the tree is greater than the best cost we've seen so far and is so that we know we don't need to keep going down. In this case here we can go further down if we can also now do apply the hash join transformation for the artist and appears and the same thing we need to be able to feed that we need to feed data into it with a physical operator either doing the sequential scan on artist and the sequential scan on appears. So now let's say again the same thing with the sort merge here. So now let's say that as we go along we want to be able to check this again to make sure that our physical properties of our nodes are being satisfied here. In this case here we know the sort merge will provide data that's sorted on the attribute we want to be sorted on so that's considered okay but if we have a hash join then feed into this since we're hashing the data it's going to be completely random where the data's actually sorted or how it's stored. So therefore this is a violation of the physical property so we know that we don't have to compute the cost of this because it's not feasible. So instead what you would have to add is a sort of an order by clause which can be implemented as with a quicksort or whatever algorithm you want your system to support and then we can feed that in our hash join into that and that would connect together down here with whatever join we're actually trying to do. So in this case here since we already know that there's some plan probably here through the sort merge that has a lower cost than us can have it compute the hash join on this plus the sort because the sort adds additional computation costs as well we know that this will never be less than the best plan we've seen so far. So we can cut it off right here and we don't need to go down and do additional transformations at lower points in the tree. And this is sort of a classic branch and bound search but what's sort of changing is that as we go down from level to level we can dynamically apply those transformations. So that's one of the big difference with Cascades and Volcano. In Volcano you would pre-populate all the transformations ahead of time and Cascades it tries to be smart and say, well since I know I don't need to go down here I don't need to do whatever transformations I would normally do. Is this sort of clear? Okay. All right so again in the case of Volcano we have to enumerate all possible transformations and we'd be careful about not repeating ourselves so this is where the memoization table comes into play where we recognize that we've already applied some transformation and we've already computed the cost of it then we can do our look up on our table and we can figure out that this is the same thing we've seen before and we don't need to maybe go down to the tree and evaluate it again. And we can do this because we're going from the top down. So the advantages of Volcano is that we have these nice and declarative rules that to make it easy for us to implement transformations makes it easier for someone to come along maybe a year later and extend your optimizer to add new functionality and not worry about breaking other parts of the system or other transformations. The memoization table is gonna allow us to reduce the amount of done estimations to have to do which again makes it more efficient to do the search and we'll see what the memoization table looks like in a sec in Cascades. The downside is again in the case of Volcano you have to implement all the possible or generate all possible operators before you actually do the search and then expand them out with all the physical operators. And in the case of this regional implementation it was not also easy to implement rewriting on predicates. Things like say where A equals one and one equals one obviously one equals one is always true so you don't need to evaluate that predicate. In the case of Volcano in the regional implementation it would be hard to sort of prune those things out because they only know how to deal with things on the so the physical, the logical operator level. They didn't know anything about the internals of the predicates. So then a few years later Cascades came out and again Cascades is essentially the same thing as Volcano but it's object oriented because that was sort of the rage in the 90s. And they simplified how you do expression rewriting to allow you to again sort of have sort of sub rules to take the where clauses of queries and rewrite them to be a more compact form that was more efficient. So I think I alluded to this last time but this is, I consider this the state of the art approach and consider this also to be the better approach and this is what SQL Server uses. This is what was in Orca. There's a Java based implementation of the Cascades framework called CalSight that's part of the sort of Apache project. I don't know whether like Hive or Presto or those guys actually use this. So I don't know what system that actually uses this. And then as I said before, this is a work in progress for us. Like one team is actually actively working on building our Cascades framework right now. So hopefully we can make some progress that pretty soon. I think also I should add this MemSQL uses this. What are the systems? But Postgres doesn't, MySQL doesn't, SQLite doesn't, DB2 doesn't. Oracle actually I don't know what they do because they're kind of cagey about it but yeah. So again this is and we'll see in the paper you guys in next class SQL Server will, SQL Server's an optimizer will not only have the best cost estimates, it'll produce the best plans of all sort of these commercial systems. Right, so the memo table is again a really important part of this. And the way to think about it is essentially it's like a graph structure. It's sort of a compact form that will keep track of all the different sort of alternatives or transformed physical operators we've produced in our query plan when we're searching them for a single query. And it's allowed to quickly identify like when we do a transformation, oh have we already seen this resulting operator before and as so what was the cost? So it's sort of a single data structure, a single way that the optimizer can store all the different things that it's doing keeping track of its history to do memorization or deduplication and maintain the different cost information about the estimates that it's doing. So let's go through a quick high-level example. So say we have some query ignoring other parts of it. Let's say it wants to do a join on table R and S and then we have a filter on the attribute A and R and a filter on the attribute B and S. And so what you have here, this is considered a group in the memo table because you have sort of one sort of a collection of physical operators and then you have all the expressions that can implement them. And these can be a combination of either logical operators or physical operators. And then what'll happen is we sort of expand out these different alternatives you can have the way to implement these different steps to be these other groups here. So in this case here to do the join on the two tables I need to do my select on R with the predicate and the same thing with the select on S with its predicate. But I can also rewrite the plan in a different way to maybe put the predicate on R above the join and then just do directly on the join on these two guys here. So then this thing would map over to other subquery. So you can have these groups point to other groups and this sort of forms the tree as you go down. So now for these guys here to actually do the scans with the filter, then you would have a physical operator that says here's how you actually wouldn't do the scan either like a sequential scan or an index scan. And again, at each of these different phases you can compute the cost of that particular operator and then you aggregate them together and that tells you the complete cost of the entire query plan. And you can see this when you look at like using explain in Postgres, it'll show you the tree of all the physical operators it's gonna have and each one is gonna have its estimated cost range. So that's the same, essentially the same thing as a group but it's not directly a one-to-one map between a physical operator and logical operators because you can do different transformations and it can go from one to n or n to one. So again, it's a giant cash table that keeps track of all these things. So I talked a little about predicate expressions in the case of volcano. The way to think about how predicates are represented in the query plan is that they're essentially like a subtree in the query plan tree. Like your where clause can be stored as all the predicates inside the where clause can be stored essentially as its own abstract syntax tree. Like A equals one would be equals of the root and then A and one and then you can evaluate those things to produce whether something is true or not. In the case of Postgres, they end up flattening out the entire tree into a single list and then you can just scan across it and imply it that way. That's not essentially faster, it's just maybe easier for them to reason about. So in the same way that we can do transformations to rewrite logical operatives in a query plan to physical operators or sort of optimize variants of them, we can do the same thing in our expression trees. You can recognize when things are redundant, like one equals one is always true, so I don't need to turn that into a syntax tree, I just always evaluate something to true. So in the case of Cascades, they have these abilities to do these additional rewrites as you do your transformations to come up with more optimized representations of the same predicate expressions. And again, that's sort of all baked into these transformation rules. It's not done in a separate phase as it's done in other systems. So what I'll say about doing predicate evaluation is that for girls whether you're using an AST or whether you're using a flattened list or whether you can rewrite them to be more efficient, if you always treat it as sort of this nested structure where you're inside of a query plan and you have to chase pointers, that's always gonna be super slow. So when we talk about query compilation and code generation, we'll see how you can make these things run even faster regardless of whether you do the rewriting part or not. Because you can instead of evaluating some tree and traversing it, you can just literally write the instructions that actually perform the predicate evaluation and that's super, super fast. And this is probably the most common technique when people do query compilation in a real system. So you still need all the do all this rewriting but you can get even more better performance improvement when you use query compilation. All right, so then of course now in Cascades because it's this branch bound search, it can essentially go on forever. And so typically you don't wanna do that because again, if your query's only gonna take a second to run, you don't wanna spend 20 seconds to actually find a good query plan. So there's a bunch of termination criteria you can use in the search. The most easiest thing to do is just set a wall clock timer and say, I'll let my query run for 10 seconds and then whatever it is, the best plan that I found during those 10 seconds, that's the one I'll use for my query. That's something usually they expose the DBA to say, how long do you actually want this thing to run for? Another more nuanced threshold would be terminating the search in the optimizer when it finds any plan that has a lower cost than some predefined threshold or some threshold for that particular query. So this is not typically something you do as an administrator because how are you gonna know what a good query plan cost would be, right? Because it's an internal number or measurement that's inside the system that doesn't have any real bearing to the outside world. So typically this is used for things like, say I have a prepared statement and the last time I ran the query, I generated a query plan and it had a cost of X and so maybe I'll let it run for 10 seconds or stop immediately when you find a query plan that has a better cost than what it had before. Like this is a way to sort of do re-optimization of existing queries. And then the last one is obviously just doing transformation is just when you can't do any more transformations for a particular group or you've searched the entire search base for the given query, you just stop, right? Because there's no more transformations to do and therefore you've exhausted everything. You've essentially found the optimal search. So you can do this on a per query basis or you can also do it on a per group base inside the memo table. And so for certain things like, again, the surgical queries we talked about before, this is, it's pretty easy to exhaust this. But in case like Postgres, they'll never get to this point because the re-writer will immediately pick out what index you wanna use so they never even go to like the dynamic search part. But so this would be useful for things like if you are doing surgical queries inside your unified search model, immediately as soon as you know you have the single index you're looking for you can stop right away because you know there aren't gonna be other options you can do. So typically you wanna use, you definitely need these two. The middle one is optional based on whether you can do re-optimization of existing statements or existing plans. All right, so now we can talk about ORCA. So again, there is a Cascades paper that I didn't have you guys read and the reason I didn't have you read it is it essentially says a lot of the same things that was in the volcano paper. The difference is that it's more object-oriented and they talk about some of the more low-level implementation details that I don't think are technically important or super important but they just sort of describe how it's better than what volcano is. But ORCA is actually a implementation of Cascades. So this was originally written as the optimizer for Green Plum. Actually before you guys read this paper who here has ever heard of Green Plum? Nobody, okay. So Green Plum was, actually who here has heard of Vertica? All right, a little bit more, okay. So I would put Green Plum is sort of the same class of data warehouses that came around the same time as Vertica, Astrodata, Allegro, Park Cell. Park Cell is what Redshift is now at Amazon. They sort of all came around around the same time as being these sort of massively parallel data warehouses that allow you to do OLAP queries or analytic queries much faster than what Oracle and MySQL could do at the time. So the original version of Green Plum uses the Postgres optimizer. And they referred this as, I guess, the original planner in the paper. So Green Plum is do whatever else does. You take Postgres, you rewrite a lot of it to make it do whatever it is that you want to do beyond just the single node display system. So they rewrote it to make it share nothing and be a column store and other things. But so they started off using the Postgres optimizer from Postgres 8. So then they sort of started writing their own optimizer, this Orca thing, and then they realized that, oh, well, we have other database products at our company, in particular this thing called Halk, and it's gonna need an optimizer too. So rather than just making our thing be specific to Postgres, they made it be this sort of standalone, monolithic program you could run external to the database system and then have some API to get query plans in and out from it, which I think is actually kind of interesting because there's no other, as far as I know, no other optimizer can run by itself. The volcano and Cascades and the Exodus, those projects were optimizer generators. So you put in your rules and it spit out the code to implement your optimizer, but then you still need to take that optimizer and embed it inside your database system. This thing runs completely by itself, right? So that's much different than everything else because say your query comes in to either the main node of your distributed system or the single node of your database and then it has to be, you have some other node is where you actually run your optimizer. And so therefore you have to use this API, this XML API they provide to send all the information that the optimizer is gonna need to make decisions like catalog, statistics, and the logical plans for every single query. You send it over to Orca and then it comes back and spits back the physical plan. So the story of Green Plum was that, again, it was one of these startups that came around the same time. Vertica was EMC bought them in 2009-ish, 2008. And then EMC is like, oh, we own a database company, we don't know what to do with it. And then VMware bought some database companies and they were like, oh, we don't know what to do with those either. So then VMware and EMC took their database companies and broke them off and formed a new database company called Pivotal. So they got like GemFire, they got Green Plum, some other things. And now Pivotal is what is in charge of Green Plum. And Green Plum is actually open source now. So Orca, again, is the optimizer that they use to support Hock, which is a sequel on Hadoop, and Green Plum. So we didn't really talk about how to do multi-threaded search. There are papers in the late 90s that are, actually a few years ago, the 2000s that talk about how to do this. Basically, it's a multi-threaded version of doing Cascade. So instead of having a single thread do the entire search, you can have multiple threads run in parallel and they have to sort of sync up through the minimization table to make sure that you can reuse the computations of other threads. So the one part that you may have missed in the paper that I find the most interesting, and this is why I had you guys read the paper, is the things that they talk about that you need to have in your optimizer when you actually wanna run this in a real system that you give to people and have them run for you. So they talk about how to do some additional components that they add to their optimizer so that when it runs out in the real world and it has problems, they know how to figure out what was going on without requiring the client or the customer to give them remote access to the database, because it's very often the case that they're not gonna allow you to do that. So they talk about this thing, about doing remote debugging and dumps, where the optimizer produces some kind of error or ends up an incorrect state. They can basically do a core dump of the entire state of the optimizer, including not just what are all the variables and the registers, but also what were all the inputs and other factors and environment variables that were going on in the system when this error occurred. And then they can ship that dump back to them at their offices and they can feed it into the optimizer and then it'll sort of replay the search and put it back into the same exact state that it was when the problem occurred at the customer's site. So I think this is really clever because this allows you to, again, when problems occur to do all the debugging in a very complex piece of software that is, in some cases, non-deterministic, to figure out what it was that put it into that state that caused it to fail. Again, if you have a 6.7, just do a core dump in your C program, you get just the back trace and whatever's in your registers and heap. You don't get all the history of how it took to get to the point where it hit the error that it got to. So I think this part here is actually very, very interesting. This is probably something we should include in our own system, but we're not there yet. And it's actually kind of interesting, too, like, again, this is something you have to do when you have people running your database system on premise. If you're giving people software that they run on their own machines that you don't control, you need something like this. If you're a cloud-based database system, like Redshift or whatever RDS stuff or Snowflake, whatever the RDS software that's out there, like, you control the entire environment. You control the software. You control the hardware. So if your optimizer gets in a weird failure, then you just, you know, essence station to the machine that failed and, you know, hook up GDB to it to figure out what was going on. And you can do that because you control everything. But if you're giving people software that they can run themselves, then I think you need something like this, which I think, like I said, I thought this was clever. And then the next thing they talk about, too, is also how they do testing to make sure that their optimizers and cost models are producing accurate estimates. So what they do is they do random sampling of the different query plans that the optimizer's generating during the search, and then checks to see whether the cost estimate for those plans inside the optimizer matches the, the relative ordering of the estimates matches the ordering of them when you actually run the real queries, the real plans. So if I get two query plans for the same query, query A, query B, and say the cost estimate says that query A takes 100, right, one average unit, so it's an arbitrary number. And the other one takes 200. When I actually run them, I wanna make sure that if this one runs in one second, this one should not run less than one second, because my estimator said that the first one was faster than the second one. So obviously you can't do this for all the different possible query plans you look at when you're doing, running your optimizer. And so they have a way to do sampling and then do this checking sort of iteratively as you go along. And there's some statistical methods to see whether this is actually a true aberration or not. So again, I think this is, and they have sort of papers about how to do this. I think this was another clever idea, too, and something that we should consider implementing in our own system. All right, so, what are my parting thoughts on optimizers? Again, the parting reason I spent two days on this, whereas on the other parts of the lecture semester, I spent maybe a single day on a topic, is that I think this is a very important problem. It's very hard, and this is actually something we're actively building now in our own system. So to give you an idea of how hard it is, I love this quote from David DeWitt. At least I think it's from him. He claims it's anonymous, but it's in his slide, so I assume it's him. He says, query optimization is not rock science. When you flunk out of query optimization, we make you go build rockets. It's just saying that it's so hard to do that if you can't do it, then if you even fail to do it, that there's other things that people would consider to be even harder, but actually easier than doing this. Again, we haven't even gotten into how to do call assessments yet, which we'll see on Thursday, and that makes this even more complicated. So from what I can tell, the literature seems to suggest that doing bottom off versus top down search, there is no real difference in that in terms of the quality of the plans that they can generate. It's my conjecture that if you do your memorization correctly and with extra memory overhead, you can actually get better plans more quickly with the top down approach. I can't prove that at the end of these things run forever, they'll always produce the same answer, but I think that the key difference between the unified search and the stratified search is that the implementation is much more easier in the unified search. It's much more easier to change it later on and add new functionality as you go along, because you don't have to worry about, again, the ordering of your different stages. And again, all of this doesn't matter at all if you don't have a good cost model. Like if you can't predict correctly that one query plan is better than another, then doesn't matter whether you're using unified search or stratified search, you're just gonna get crap anyway. And as we sit again and we'll see on Thursday, the way you get good cost estimates is that you need good statistics, but we'll see how everyone's really, really bad at this and the paper you guys are gonna read, and there might be, you know, sampling might be a better way to go about this. So any questions about query optimization, at least from implementing the actual search algorithm? In the back, yes. So how far away it is that you think that existing query optimizer could be an actual optimal routine? So his question is how far off do I think query optimizers are from generating the optimal plan in commercial implementations or just in general? In general. I mean, in general, it's, again, it's empty complete. So if you have a two-way table join, like, yeah, you can probably find the optimal join pretty, you know, ordering for that. But much larger queries, it gets really bad. I actually don't know what the state of the art is in terms of how good optimizers are supporting, like how many tables can an optimizer support in a join? The Orca guys told me that they were all proud because their Orca could do 35 tables, right? The Splice Machine guy or the Memsegal guy has told me that they can do 135. So I don't really know what people will believe, you know, it depends on how big those tables are and what you're actually trying to do with them. There is no, like, the paper you'll read next class, you'll see that they have sort of a benchmark to evaluate how good optimizers are at picking join orderings. But in practice, again, this is sort of like, it's like deep neural networks. It produces an answer that seems good enough. And for most people, that's probably fine. This is doing, tweaking the plans generated by a query optimizer is probably what the, what database administrators spend the most time doing. Like the optimizer is simply just get something completely wrong and then go back and you dump out the XML of the query plan, tweak it in a tool and load it back in. That's a very common task that administrators are doing now. So I would say, okay, I think, again, I think this all hinges on having the statistics. And once you start doing a bunch of joins, your statistics get way off, right? Because you're trying to derive statistics from derived statistics and it gets really bad. So a small error in the sort of lower levels of the query plan can get exacerbated or you know, blown up to much higher degrees as you go up. So to go back to your original question, I don't know, I don't know how far over you are. My intuition is that from talking to people that we're pretty far away. And I think that some people are sort of looking at using deep learning to apply it to the, some of the query opposition searches to sort of speed them up. This is something we're interested in doing. We haven't started yet. It's not clear whether there's any sort of magic bullet coming down the pipeline that's gonna make this all go better. Now, there's not gonna be, I don't think for the hardware we have, there's not gonna be some major algorithm of breakthrough that's gonna make this all go better. The Cascades is like 20 something years old now and it's still considered sort of state of the art. All right, it's like quicksort. Like there's no magic to make quicksort super fast for any possible use case. Like there's tweaks for it for very narrow data sets but like in general, there's no magic way to make quicksort become more faster than it is now. So, okay, yes. With the cost threshold. Yes. As a stopping regime. Do you, do you debate normally like the prepared statements like start over every time since you remember the entire state, can you just pick up where you left off? So the question is in the case of the cost threshold, I said that as far as I know, do you do this when you do re-optimization for existing plans? And his question is, is this something you would do for a prepared statement? Absolutely yes. So remember, so prepared statement, again you have those placeholder variables and what you would do is, since you don't know what they're gonna be at runtime, you could, you can just take maybe the average value for whatever it is actually that you're looking at and use that to do your query plan. So then now you have a cost. So now you can come back and say, once I know the real values, I know what the best cost would be when I take the average, search a little bit to see whether I find a query plan that's better than that. That's exactly what this is for. Yes. But can you save the state of the optimizer for later use? Like keep having your way through? So his question is, can you save like, can you save the memorization table on the state of the search so that when you come back, you don't apply all the transformations all over again? Yeah, I suppose you could. I don't know if anybody does that. Again, I think this is only used, I think in the commercial sense. Like I don't think Postgres would do this. And you have to force it to do a complete replanning and it might just throw away what it would do before. Actually, I'll say also to you, the tricky thing about this is like, the cost can change, right? Because the database can change. So if I generate a query plan for a paired statement when the table is empty, it's gonna have one cost. Then I load in the billion tuples and now that cost is completely invalid because the database is much bigger than it was before. So in that case, I don't wanna use what the cost was from the prior one because I'll be searching for a lower cost that it's never gonna be, it's never gonna be the cost of what it was when it was completely empty. So again, doing this one I think is really tricky. It's pretty much the first one and the second one is what, is at least what all the, most databases actually implement. That's a good point, yes. Yeah, sort of not, I guess I didn't really talk about this, but like in the case of prepared statements, how do you decide when you replan? I know in like commercial systems, you can force it to replan things, like actually the prepared statement and oh, go ahead and re-optimize it. I think in like Postgres, you would just, you delete the prepared statement and then recreate it. Not essentially does the same thing, but it's not automatic. That's a good point though. Okay, so in the sake of time, any other questions? All right, so I wanna jump to project three. So again, the final project for this class is gonna be some, you guys are gonna have to implement some new feature, new functionality, our new component in the database system in Peloton. And it has to be a substantial amount of work, right? Cause there's three people, you're gonna have two and a half months to do this. So it has to be, you can't just say we implemented nested loop join, right? It has to be something more significant than that. And so I would say that the way, the best way to be successful in this is that you obviously don't pick something just cause I told you to implement it, you pick something that you're actually really interested in. So if you're doing research in another class, you're doing research on your own as, in graduate school, by all means, pick something that's related to what you're already doing. Like if you're doing PL research for whatever reason, if you wanna maybe apply some of those techniques inside our database and, that might be the right thing to do. So the only, so in the end, it'll be up to my discretion to decide whether your project has the appropriate scope. And the key aspect of it is that you have to make sure that you, that you have to pick something that's unique from everyone else. So I don't want two people both implementing UDS, right? I'm gonna make sure everyone picks different things. So what you guys are gonna have to produce during the semester are five things. So we'll talk about each of these steps as we go along and these will be due at different points in the course. And that way it's sort of like again, a checkpoint or a snapshot for you guys to make sure that you're progressing as the semester comes along working on the project because I know how it's gonna be at the end of the semester, come spring, it's gonna be warm outside and you're gonna have a bunch of other stuff you need to do for all your classes and then I don't want you scrambling the last two days to try to finish your project because it's not gonna work because the amount of code you have to write is not something you can do in a short amount of time. So as we go along with the semester, we're gonna have different checkpoints to make sure that you are making progress. So on the first Tuesday after we come back from spring break, so it'd be two weeks from now, every group is gonna have to come up here and give a five minute presentation about what they think their topic is gonna be about. So in this proposal, isn't it just gonna be like, hey, we think this is cool, this is what we're gonna do, I want you to spend time over spring break actually looking at the code or talking with me or Dana or the other people working on the database system to figure out what it's gonna take to actually implement the thing that you're thinking about doing. And the easiest way for me to know whether you've thought this through correctly is if you identify what are the files you're gonna have to change, right? Then you'll also just think about what are the test cases you're gonna need to write to check to see whether your implementation is correct. So I'll talk about the different modes of testing we can have in our system. You've guys looked at the unit tests for your skip list and to compare function, but we have other mechanisms to do like SQL functionality testing at a higher level that you may wanna use as well. And then again, for this, you also wanna describe what are the workloads you're gonna use to check to see whether your project is working correctly. Again, this is part of the reason why when you guys did the reading reviews, the synopsises, I had you list like what are the benchmarks or workloads they're using because when it comes time now to pick what you're gonna do for project three, you don't have to figure out, you don't have to come to me and say, hey, I wanna do transaction stuff. What workloads should I use? Everyone here should know what database benchmark. It's TPC, what? What's that? Y6B. No, no, Y6B is the Yahoo one. So it's a TPCC, there's all these different benchmarks, right? There's TPCA, TPCB, TPCC, TPCD. For transactions, what benchmark do you wanna use for the TPCC? If you're doing analytics, you wanna use TPCH, right? So just list which of these things you wanna do and we have code already written for you that you can just use, so you don't have to implement this stuff yourself. So then, sometime in April, you'll do a status update. You basically come in and say, here's how far we've gotten our system, here's if anything's changed in our outline or plan, what we're gonna do, here's why it changed and here's how we're accounting for this. And then I think it's also useful to say like, if there's any surprise you found when you're actually in the bowels of the system, whether it's something really cool that you didn't think existed or something really terrible that we wrote and you hate, I think it's good to come in and share these kind of things so that way if other people end up sort of looking at similar code for their project, they don't hit the same problems or they find the same things that you find, so that way we're sort of sharing all the information that we collect as we go along. There'll be, throughout, again, from now to the end of the semester, there'll also be required code reviews. So last year we did these code reviews on the last week of the semester right before everyone had to turn it in and this turned out to be a disaster because not only were you trying to write your code that you needed to do with your final presentation, but you also had to go review somebody else's code and get feedback on your own code and no one, everyone really liked this and everyone thought this was helpful but they just wish they had more time to take the feedback and make changes to their code. So now what we're gonna do this time is we're gonna do two separate code review rounds during the semester. So right around when we do our update, again, we'll assign you to what each group will be assigned to another group and basically you go read their code and GitHub has a nice reviewing mechanism where you can go write comments and say, you know, this doesn't make sense, fix this, fix that. And again, everyone here knows how to program but you'll make yourself a better programmer when you see other people's code and you get feedback on your own code. So for this, this obviously means that in order to get a good grade on this because it were based on recent participation, you have to have code for somebody else to review. Right, if not only do you get graded on the review you do but if the other group can't review your code because you haven't written it yet, then I would deduct points from you. So it's very important again to treat this requirement very seriously, this deadline, to have code in a reasonable state where there's something to look at and then provide feedback for somebody else. Because again, in the end, it'll make your project better. And it sounds like I'm just making you do busy work but this is the main feedback that I got last semester or last year when I talked with the courses. Everyone really liked the code review, they just wished there was more of them and they wish they had more time. So then for the final presentation, you'll have 10 minutes on whatever day we're assigned to do the final exam. I don't know the exact day yet but I think it'll be some other room here, not this one, we'll get pizza, we'll get wings or whatever else food people want and you come and do like a presentation about what you actually accomplished in your project. So this is when you wanna include any kind of performance measurements you've done or benchmarking with different workloads to see how well your implementation is actually working. So last year we had kids actually do live demos of different functionality to add to the system which I think people always really appreciate. You just have graphs and show like the numbers go up that's not that really interesting. Like having live demos is really cool. All right, and then when you finish up the project at the end of the semester, you will not get a final grade for the project until you give me a code drop and a code drop is not considered complete until I know that it can be merged without any conflicts, without any test failures into the master branch, any comments that you got or feedback you got during the code review process, all those are addressed and fixed and then you have test cases to ensure that your thing is working correctly. Both you have to pass the old tests and the new tests. Now if there's an old test that you need to break or throw away because you've rewritten some major part, that one for those kind of things you have to come and ask me, right? I don't want you just commenting out code because you don't like it. So if there's tests that break because you change things then I'm okay with you removing it because you have to ask me first. And they need to provide documentation that says basically what your thing does and how can someone actually use it. So obviously doing merging, not everyone's code will end up being merged. I think last year 50% of the projects made it into the system. The new optimizer that people built last semester that actually went into a side branch when you eventually merged that back in later on. So obviously the order that you do merges can matter a lot because if you're the last then you have to make sure that your thing doesn't conflict with everything else that comes up for it. So to be fair with everyone else we'll sort of do this in random order and then maybe we make a decision about whether you think your project is finished or not and should be merged into the full code base or whether you think this is something you maybe want to do as a captioning project in the fall, you could put it on the side and sort of hold off on having to merge it till later on or put it into a special branch. So again this one, again we'll play it by ear and see what happens, what people do. So now in the last 15 minutes I want to talk through some simple projects that I think people could look into. And so I'm not going to talk about the query optimizer and system catalogs because these are actually active projects that are people in the class that are working on active like right now. Like Patrick and Chan and Bo are looking at the new query optimizer. Yingxin, Mengren and Leon are actually building the new catalogs for us now. So I'm just going to talk about all these other ones here. But I will say that I'm trying to pick projects that do not rely on these guys finishing up in time for you. So for all of these that I'm showing here, you can implement them without having the catalogs of the optimizer in place. You can use what we already have and still implement it. And all of these are designed to be independent from each other. Because again if you require code for somebody else and they flake out, you don't want that to affect your grade. I will say also that last year, one thing that worked really well was people were actually fixing certain bugs that were affecting them and they were pushing patches to the master branch and sharing them with everyone else. So even though everyone's been working on their own project, there may be sort of things that you fix as you go along that you may want to share with everyone else. So I'm totally okay with merging that to the master branch and propagating that to everyone else as we go along. All right, so materialized views are something that I'm interested in doing. So materialized views sort of like a view where, but instead of recomputing the view every single time you run the query, you can instead maintain this unaligning internal table that is updated anytime the base table is updated. And then that way when you run a query against a materialized view, you don't have to run the query all over again. So the way you can implement this is you first have to implement triggers and then you maybe have to implement some simple way to do some delta updates to your materialized views and you can do this without again relying on having a query optimizer account for this. You can write test cases and the basic things that just treat it as a regular table even though underneath the covers it's being materialized incrementally. Another aspect of this, what I guess I didn't talk about what I could talk about later is doing this in the context of the LLVM engine which I'll talk about in a few slides would actually be very interesting as well. So how do you maintain materialized views in a compiled query compilation architecture? Constraints are a very important feature in our database system and we do not support them in our current implementation other than I think primary key constraints. So the idea here is that you would implement the sort of the core functionality to make sure that people can't insert things that they shouldn't be inserting. Like you can have check statements to make sure people can't insert negative numbers, make sure people that insert the right types into tables. And then the ultimate goal is that if you can implement foreign key constraints. This one's more tricky because you have to implement, you have to provide secondary indexes to make sure you do efficient lookups to see whether the key actually exists when you insert things. So this sort of be a combination of updating the infrastructure in the table to account for constraints as well as actually implementing the functionality to update the secondary index or foreign keys or whatever else you're using to maintain these constraints information as new data comes in. UDS are a way to implement a functionality in SQL that doesn't already exist. So think of this as like say there's a string function that I want but the database doesn't implement it. I can implement this in the UDF API, store that in the catalog and then invoke it as if it was a substring or any other built-in SQL functions. So we had students work on this last year and what they ended up doing was they barred in the previous version of Peloton we had sort of all the Postgres front end. So they were able to hook in the old UDF parser and API to make that work in our expression system. Of course then in the summer we ripped out all Postgres so now none of this code is actually exist anymore. So the idea here was be able to bring this back in just the pieces we need from Postgres and integrate this in the full system. And you have to update some catalog to keep track of what UDS you have defined and you may have to modify the underlying implementation of it to use our tables instead of Postgres tables. All right, so the one thing that I'm really interested in this semester in addition to the catalogs and the optimizer is this new LLVM engine that we've been building out. Actually, who here knows what the LLVM is? Some people. So the LLVM is cancer the low level virtual machine. Think of this as like the JVM but instead of using your Java byte code you can generate this IR, intermediate representation language dynamically directly inside of C++ and then you can fire it off into the LLVM to execute it for you very, very efficiently. So think of it as a way to like embed the JVM without bringing in all the overhead of Java and its garbage collection and all that other stuff, right? So we'll talk about query compilation, code generation after the spring break. The spoiler is that the performance difference you can get with the LLVM engine is quite significant. I think our current experiments, this is like 50 to 100 X faster than the current system. So obviously we wanna get away from the current extrusion engine that we have in Peloton and switch everything over to this stuff. So my students are now actually working on porting this code into the current system but the initial version will only support selects. So what'll happen is you'll go through the query plan or the optimizer, you have a Peloton query plan. If it's a select then it goes down into LLVM. If it's an insert, update, delete it has to go through the interpretive plan because our LLVM engine doesn't know how to compile update queries or transactional queries, right? So for this project here for someone who's very interested in LLVM and it's this low level optimizations you would add support to generate query plans that you can compile for insert updates and deletes, right? And we have sort of code already exists to show you how to do this for select statements. So for example doing deletes would just be sort of an extension of how to do selects. And again the performance difference you can have is quite significant. So there's a bunch of other sort of machinery you need to make this all work which I'll talk about too later on when we talk about code generates a query compilation but to avoid having to regenerate the query plan over and over again systems like MemSQL will extract out all of the constants from your query plan and cache the compiled LLVM plan. So think of this as like it automatically generates prepared statements without you defining prepared statement. So we're bargaining the way to support that and then because now you have to do in transactions and you're updating things you need to account for tuple visibility in your NVCC protocol in the simplification. So I would say that this project is again is something that you can make a good start on in this semester and if someone's interested in coming back and doing capstone or independent starting in the fall because I think this is a bit more complicated this would be an excellent project for that. And there's enough stuff to do in this that we can find something for everyone. Okay the next thing we're interested in doing is adding support for multi-threaded queries. So right now because again we were based we were originally based on Postgres we had this sort of legacy architecture if you will where a single query comes in and it's handled from beginning to end by a single thread. You can take a query and break it up into different chunks and have it multiple threads run at the same time. So we're interested in doing to break this model and allow a single query to be broken up by different threads. So we have one student that's finishing up code this month on disconnecting the networking threads from the worker thread so now there's a worker thread pool that can then you take task and execute them for the queries but we're missing his ability to take the query plan and break it up to subtask and have it run in parallel and then coalesce the results and put it all back together. And we'll talk about query thread scheduling and execution or multi-threaded query execution and scheduling after the semester but the thing I would say is that makes this more tricky related to the LLVM stuff that I talked about before is that we don't wanna do this for the interpreted engine we wanna do this for the LLVM engine and that makes it a little more difficult. You can do this for the interpreted engine we had students last semester do it but doing it for LLVM is more complicated and we'll talk more about how you handle memory stored in different sockets and make sure that you assign threads to the local memory again after the spring break. We're also very interested in database compression so we talk about dictionary computing, delta encoding so the idea here is that you wanna add support for this inside a Peloton and the nice thing about this is that I think that if we do this correctly the compression scheme can be independent from all the upper levels of the system so you don't have to worry about modifying the executors or modifying the query plan or the optimizer you just implement all this compression directly inside the data tables and I think everything should just work. I would say dictionary coding is probably the better one to do instead of delta encoding because it's more general purpose and again we wanna do this in an environment where not only are we running the OLAP queries but we're gonna run transactions that update the database so we may need to modify the dictionary as it goes along. The next one that we're interested in doing also too is adding support for alternative networking protocols. So currently we only support ODBC we support JDBC and the PseqlWire protocol but it's very common now to have different interfaces or different methods for people to read and write data in your database system. So in particular we see a lot of data systems adding support now for Kafka or Memcache D and the idea here is that instead of writing a SQL statement you could implement to do a get or a set you could implement the Memcache D protocol and go directly to the system and not worry about parsing SQL just do a direct call. We had a student actually work on this last semester or actually last year so we have some of the Memcache code that you can reuse but we don't have any of the Kafka code and so if someone's interested in doing Kafka with ingestors and I guess what is the Kafka parlance for sending data out? Admitters, if you wanna ingest and send out data to Kafka you can implement that. So the tricky thing about this particular project again this might be another one that would roll over into a captioning project is we're probably gonna have to rewrite all the networking code that we have or not rewrite but like refactor the networking code that we already have because it's sort of is it's doing a lot of different things all at once so the networking code that we have that opens up a connection and can handle the TCP packets back and forth it also contains code to do query plan caching for paired statements but also contains the code to figure out the correct state of transactions so in the current implementation it would be not very easy to implement these two additional protocols you wanna refactor the code and make it a bit more generic so again I can talk more about this later on. The currently our database system we don't maintain we don't maintain any statistics about what's actually inside the database so we can't tell you the cardinality we don't have any histograms for what actually the values are for the system so for this project you can implement all the mechanisms you need to actually collect this data and there's two basic approaches you can do this you'll see in the paper next class you can do this lazily sort of in the background you collect statistics and start them in your catalog or for every query that comes in you can maybe do some little bit of sampling of the data and figure out what the estimate should be on the fly regardless whether which one you choose it's up to you but once we have the ability to now collect statistics that's needed for a particular query in some predicate it'd be nice if we could sort of wrap all this together into a new cost model which could then interface with the optimizers team new Cascades implementation so that's sort of be the stretch goal for that and then the last one is to do alternative storage so for this one right now in Peloton we can only store data in our MM restorage engine in things like Postgres they have foreign data wrappers this is sort of like the I guess the reverse of the Kafka memcache project where instead of taking new stuff in and out from at a query level you're at the bottom level of the system and you want to read and write data from some remote source and so Postgres you can do this with the foreign data wrappers we had this in h-store with intercaching where you can dump data out to disk when it runs out of space and memory so the idea is that you basically implement a new storage API and have a clean abstraction layer that you can then maintain still the in-memory data table format that we have now but then also read and write to some other external sources and to start off I would recommend using like LevelDB or RockDB something really simple and of course you have to update the catalog make sure you know actually what's being stored all right so just to finish it really quickly so again I went through really fast these different projects I'll be around in spring break if you want to talk about you know go a little more detail of what you actually you know think about doing for testing again we have a SQL based regression test suite that we're expanding right now that we can provide for you guys to make sure your thing is actually working correctly but again you're also going to be required to implement your own test whether that's done at the SQL level or the unit test level you probably want to do both but to what extent you test things more than another it's up to what you're actually doing and then for the computing resources you have all the machines that the mem SQL machines that cluster you have access to that for project one or two you can use for project three if you're doing a project where you think you need special hardware then please let me know and I can get it for you so last year we had students implement the silo or logging and checkpoint scheme and they wanted to have parallel disks to be able to test their implementation so we went and got them some SSDs to put into the machines so if you need something let me know and we can make it happen again none of you should be ever implementing TPCC, Y6B or TPCH in your life I have a open source benchmarking framework that implements JDBC that is ready to go and use for in Peloton for all your projects I don't know whether we I know for a fact we're not going to support all the workloads we have in here we know TPCC works, TATP which work, and Y6B should work the more complicated things talk to me and we can figure out ways to get around this and this this is sort of limited to because our query optimizer is sort of hard coded right now for TPCC and Y6B but as we expand it we should be able to support other stuff for you but again the bottom line is no one no one here needs to implement TPCC we already have it for you ready to go all right so to finish up on Tuesday March 1st, 21st which is the first Tuesday after you come back from Spring Break everyone will stand up here they have to do a five-minute presentation about what the project topic is going to be and you also have to send me a little paragraph just so I know what you're actually going to work on so I'll send an email out with a link to a Google spreadsheet where you can again list who's actually going to be in your project group and what your topic is actually going to be and that way I can make sure that everyone's doing something different so I'll be around all during spring break so if you have questions or what more details about any of the things that I talked about here I'd be happy to be with you one-on-one and discuss it further so any questions about project three? all right so next class we'll spend more time talking about cost models I would then also spend a little bit of time with sort of general hints about how to you know things to do when you start working on a large code base in the case of the skip list and and the compare function you guys are only looking at a small subset of the system right you're only looking at code that we pointed to you at and said here's the files you need to modify but in project number three it's sort of you could look at anything so I wanted to provide you with some general advice on how to you know deal with working the large code base with a lot of code that you didn't write and then I'll also announce what the extra credit assignment will be as well okay any questions all right see you guys on Thursday