 Today, we're going to talk about networking protocols. This will be how we're actually going to be able to communicate with the database system. Before we get into that, I want to quickly go over what's on the agenda. So again, project one is due tonight at midnight. Who here is done? It's quite a number of you. Yes, good. Excellent. All right. Then project two will be released today. So I'll talk about that at the end of this lecture here of various projects. If you haven't signed up yet on the sign-up sheet, please go do that. I know at least three or four of you that are looking for a group will put you in a group. Then I'll also post the webpage tonight with the exact details and deadlines or due dates for all the various parts of this project that we'll talk about today. Then next week, we'll give out the extra credit to sign up for that. The midterm exam, again, I'm passing around. This is the midterm exam. I gave last year. So the two stacks of papers, one is the actual exam, one's with the exam with the solutions. So you went two sheets for each one and stapled together. Then again, we'll announce this on Piata, but this will be after spring break. All of you are going to come back and in class on March 18th and give a five-minute presentation on what you're going to do for the first project or the second project. Okay. All right. So midterm exam. Next Wednesday in this room, 3 p.m., right where you're sitting now. So it's going to be similar to when I'm passing out now that I did last year, be a mix of multiple choice and short answer. I will say that last year, everyone complained that it was too hard. They weren't wrong. Part of the reason why maybe you said it's too hard is like, this is graduate school. It's not like, hey, what does this paper say? I don't care about that shit. I don't care that you can remember or wrote down on your notes, what exactly this paper actually did. It's more about the bigger ideas of can you synthesize the things that we've talked about and start to conceptually put them together into a system, and understand the trade-offs and the various design decisions you have to make when you do these things. So you'll see this and the questions I've passed around, right? It's not like, how does this work or what does this paper say? It's out like, what's the big picture? Because that's what's really important. Because you're not going to remember five years from now, if you're off on a job or doing research somewhere, you're not going to remember what the end of these papers said. It's more about the big picture ideas. That's what I care about. So for this reason, it's closed notes because, again, I don't care about what each individual paper says. The final exam will be take home. So that one you can read whatever you want. But for this, I want to keep it focused on the big picture stuff. So I've already posted on Piazza what the extra topics will cover. But in terms of the lectures, if you go look at the schedule, it'll be lecture one to lecture 12 inclusive. So I think lecture 12 was like, this is 13. So 12 would have been, what did we did last class? Checkpoints and recovery, right? We're not going to do networking. We're not going to do thread scheduling. Like this lecture and the next lecture or next week, that won't be on the midterm. So it's everything from last Monday to the beginning of the semester, okay? Any questions? Okay, and if you just walked in, we're passing around two packets of paper. One will have the solutions and one will have the exam questions, the midterm last year. You want two sheets from each packet and stable them together, okay? I think you guys haven't gotten it yet. So we'll make sure it makes its way over here when you guys are done, okay? All right. So just now to now provide an overview of where we're going in the course and what we've done so far and what's ahead of us. This is sort of a high level overview of what the database system we're trying to build in our minds are discussing, what it actually looks like. So we have our application server and it's going to send us a SQL query that it wants to execute in our system, and the first thing it's going to land in is what we're calling the networking layer, right? Which is what we'll talk about today is how do we actually communicate the database server and send it a query, but then this is going to go down the stack and actually go through the planner, the compiler, execution engine, and storage manager. So the planner would be like actually parsing the SQL query, looking in the catalog and figuring out how to map the name of the table you're accessing to some internal identifier that's called the binder. It'd be a rewriter phase which is doing static changes to the query plan or the query itself to make it more efficient, and then there'll be the optimizer with the cost model. Then it goes to this compiler and this will be sort of unique to the types of database systems we're talking about in this class that this is what modern systems actually do. This is not something we covered at all in intro class and many traditional database systems don't do this, but the performance benefit is quite significant. Again, this will make more sense as we go through the latest semester. Just know there's a step in here. Then we have the execution engine where we're doing thread scheduling, thread placement, or task placement, concurrency tool, operative execution, indexing, and then we land down in our storage manager where we have different storage models, different layouts, and get to do logging and checkpoints. So far in the semester we've covered these topics here. So we know how to build sort of the bottom part. We'll get to thread scheduling on Monday and then we'll talk about operative execution after the midterm. This will be like how do you do low-level parallel vectorized execution of joins and things like that. But for the rest of the semester, at a high level we're going to go in this order. Today we're going to start about the networking layer. Then we're going to jump back down to the execution engine and then work our way up and talk about the execution engine, the compiler, and then the last classes will be about query optimization. And other years I've done it in order from the top down. For this year I've sort of front loaded the, I've added some new material that didn't quite fit into this hierarchy. And so that's why it's slightly different. So this is where we're going. Today we're talking about how to actually take the queries in and run them. Then starting on Monday next week and then after the midterm, after spring break, we'll talk about how we actually run the queries. And then we'll get into compilation parts, which again is a more efficient way of running stuff. Okay? So this is where we're going. All right, so today's agenda is the following. So we're going to spend most of our time talking about networking protocols, what they actually look like or the wire protocols, what they look like, why do they need them. And then we'll talk about how the operating system is our enemy and we're trying to avoid it and by using kernel bypass methods. And then we'll finish up talking about how, what various topics we can do for project two. Okay? All right. So essentially the problem we're trying to solve today is how programs can access a database. So this is called the Access API. All the demos that I've given so far in this class, when I open up the terminal, all of that were through a command line interface where I would go on my keyboard, type some SQL queries, hit enter. Then the terminal will then send a message to the server. Here's the SQL query they want to execute and then it would execute and come back with the result. So that works for like, again, simple demos, but this is not how we're actually going to, can you pass along this way? Because these guys need it too, sorry. Yeah. This is not how the real programs are actually going to interact with the database because that would be super slow. If you're sitting and scraping the terminal and then parsing the text, that's no one would write an application like that. So instead we're going to look at how we're actually going to do this in real programs using different Access APIs. And there's basically three categories, broad categories of how you can do this. The first will be direct access, which is just like a low level API to access the system directly. And this will be vendor specific or database system specific. And then we'll talk about these two different category of libraries that are sort of the standard you would use for different applications, like ODBC and JDBC. So the way to think about this first one here, this is like, if you've ever done like, how do I say this? If you've ever used like a SQLite in like a C program, like they'll have like their own like low level libraries you link in that allow you to interact directly with SQLite. Or Postgres has another one, I forget what it's called though. But these are again, these are libraries that are very specific to the database system and very specific to like the language you're implementing. The problem is these are again, as I said, they're database system specific. So like if I write my application using the SQLite library access methods, then if I want to switch over from SQLite to like DB2 or Postgres, it's not going to work because I'm making SQLite calls into SQLite and it's not, you know, DB2 doesn't know anything about it. So that's the problem that these guys are trying to solve, they're trying to provide a standardized interface that allows me to write programs that interact with databases that can then be that is portable across different database systems. So the probably the most famous one is called ODBC. And this is this again, the standard API that we can use in our program to access the database. And it's designed to be independent of what actually the database system is and what operating system running on. So you could have ODBC drivers for, you know, Windows, Linux and whatever. And typically they're written like C or C++ but then you would have language bindings to make it work in Python and whatever else environment you want to use. So although this wasn't the first attempt to actually try to make one of these database connectivity libraries, for whatever reason, this one took off. Like in the late 1980s, some people out of CyBase started working with other vendors and tried to actually make a standardized library. I think it was called DB library where they took what CyBase did and they ripped out all the CyBase specific calls and they tried to make it generic. For whatever reason, that didn't take off. And then Microsoft got hooked up with this other company called Simba Technologies and then they invented the standard called ODBC. And then for whatever reason, this is just became what everyone follows today. So pretty much if you're going to be a database system that you want people to actually use, you have to support ODBC. And it's actually not just specific to relational database systems. A lot of the NoSQL guys that don't support SQL will have an ODBC driver as well. In the paper you guys read, they were comparing against MongoDB and MongoDB doesn't support SQL but they have an ODBC library. Because it's sort of, again, the idea is that it's hiding away, I mean, it's not hiding away the query language, but it's hiding away the actual wire protocol implementation to actually communicate with the database. So the basic idea looks like this. So the ODBC is called a device driver model. So what that means is that you're gonna have this driver, I think it was just like a library you link into your application that provides a standard API to your application but then underneath the covers and knows how to take whatever request that the application makes against that API and convert them to requests that go over the network to communicate to your database server. So in the parlance that we're gonna care about, we're gonna call this part here, the actual construction of the packets and the state machine we have to follow to process the request of these packets. This is the wire protocol and this is the part that's gonna be specific or proprietary to pretty much every single database management system. The Oracle has a wire protocol, DB2 has a wire protocol, MySQL and Postgres, they all have a wire protocol that don't look really anything alike at a high level they do but the actual, with the contents of the packets are completely different. So this is the part that is vendor specific and then they will each implement their own ODBC driver like you would download the Oracle ODBC driver and the Postgres ODBC driver and they'll have the same interface but underneath the cover it's them communicating through this proprietary wire protocol with the targeted database system. So the way you actually act to make this work is like because the standard specifies what this data should look like coming out sort of this way into the application but what the database system is gonna send you may not actually match up with that, right? Like for whatever reason like you could ask for an integer you assume in the application it's gonna be a 32 bit value but for whatever reason they store everything as 64 bit values. So it's the driver's responsibility so when it gets that 64 bit value to convert it into the correct format that the application expects and this again this is defined in the standard. It can do a bunch of other things too, right? You know the sort of standard calls like connect to the database, disconnect, send queries, get results, things like that but it can actually emulate some of the features that the data system actually doesn't have but it's actually required in the standard in the driver itself. So let's say your database system doesn't support cursors so think of a cursor as just like an iterator. Like I can tell the data system to open up a cursor on a query and then what I get back is now handled to that cursor and then I can call get next, get next on that handle to get each row one by one. By default if I send the SQL query without a cursor I get back all the results at once which may not be what I want. So not every data system will support cursors so you can emulate that inside here, right? Because again everyone has to implement it has to expose like the same API. So again this is what Microsoft came up with this is what you would use in like the C++ world. If you're in Java the equipment is called JDBC or the Java Database Connectivity Library. So this was developed by Sun in the late 1990s because they recognized that they wanted Java to be running in enterprise settings enterprise applications want to communicate with databases so you need a standard database API to do this, right? So again at a high level it's gonna work exactly the same as ODBC like the standard interface that all your Java programs can access the database through but then underneath the covers they would have whatever vendor specific wire protocol commands and handling embedded in the driver. So for JDBC there's actually four different ways to implement this, right? Again at a high level it's the same thing as ODBC but how it's actually implemented can be slightly different. So the first approach is that you actually don't have a way to communicate in Java directly to your database so you just provide a little bridge that says that routes whatever JDBC calls you have in Java to the ODBC driver that's running on your same machine, right? The next approach is that you have the JDBC driver communicate directly with like the C API the C++ API against the database system itself. But again think of like an embedded database if I wanna communicate with SQLite running directly in my system I could go make calls to its native API and not worry about going over the network. The next approach is to have a middleware system that just takes all the JDBC calls that come out of this middleware, out of the JDBC driver and then converts them into the vendor specific wire protocol for the target database system. So the way to think about this is that like I have my Java applications in one in one process and I'll run this middleware and another process and the driver just communicates with that middleware through some JDBC specific protocol and then the middleware then communicates to the data systems sort of like an extra hop along the way because the driver can't communicate directly to the database system. The last one is probably what is what is the most common one and in my opinion which you actually want because I think that this will get you the best performance is you have a implementation of the JDBC library and the wire protocol or your database system written entirely in Java. You plop down a single jar file that has everything written in Java for you. The reason why I think this is the better one because this reduces the amount of copying you're doing going from between these different levels. This one you have to be running in the same process. This one has an extra copy or an extra hop to get through the middleware whereas this one can go directly to it. So this first one here is actually not supported anymore. They got rid of it in JDK 1.8. There's a pain in the ass to set up too. I remember trying to set this up for times 10 from Oracle and that was a nightmare. So again, this is what pretty much everyone does. Like you'll have a pure Java implementation of your driver. So as I said, the wire protocol itself is again going to be specific to the actual database system. They're not universal. The JDBC provides a standard interface but then what packets they actually send is specific to each database system. So to the best of my knowledge, every single database system will communicate over the wire protocol through TCP. I'm not aware of any database system that uses UDP. And if you think about it, it would be kind of weird especially if you're doing transactions because you have no guarantee whether your UDP packet showed up and so you try to commit a transaction and the message got lost. So that would be bad. So for that reason, everyone uses TCP. So the standard setup is that the client starts up once connected to the database system. So it goes to the authentic agent process like username, password and all that kind of stuff. You may be doing this encrypted or SSL. That's pretty, I think it's the default in those systems now. Then you send some query. The database of X use that query. Then we serialize the results and put it back into packets that we can then send back to the client who then knows how to convert them if necessary into the form that the access API expects. If ODPC expects the data to look a certain way, so we have to then convert it and transform the data we got back in the database system to conform to the ODPC standard way representing data. So what we care about today, what the paper you guys had you read is this part here, serializing the result. Because this is actually gonna be the main bottleneck, the main thing that's gonna cause performance issues for the type of workload that the paper was discussing, like getting, trying to get data out of the system. Because when you think about it, this one's unavoidable, who cares? It's one back and forth to say I've authenticated and we can encrypt that, that's fine. Whether we wanna do something fancy with Kerberos, it's not on the critical path for executing queries that we're gonna open up a connection and it's gonna stay open for a long time. So who cares? Then the client sends us a query and that's gonna be what? A SQL string, in the worst case, in the best case scenario, maybe like a store procedure or like a prepared statement handle. So that's small too, right? Most SQL queries are a couple kilobytes. There's some, I know some major companies that they have queries that are like 10 megabytes, right? No, because it's like, it's some dashboard where you can select like, you know, various options. Like here's all the people that live in the state or in the zip code. So you have these where clauses with these giant in predicates, like it is state in and then you list every state, so you can have a lot of this and then it's like 10 megs. So, but even then it's like, there's not really any magic we can do to make that work efficiently other than compress it with like Snappy or GZIP because it's just text, it's just a string. So there's no magic there. It's really this part here, right? That's what we wanna focus on. That's how we wanna, that's what this paper you guys read about was trying to make this better. So the other thing I'll say too also is that, you know, I say these are all major database systems implement their own proprietary wire protocol. So that's certainly true for every single, you know, big database vendor that you know about. Oracle, DB2, Microsoft, Postgres, MySQL, all of these guys have their own proprietary wire protocol. And in recent times, in the last 10 years or so, we've seen there's been a lot of new database systems or database startups where rather than reinventing the wheel and coming up with their own wire protocol from scratch, they actually just reuse the existing ones, right? In particular, MySQL and Postgres, right? So the basic idea is that rather than you having to just make your own wire protocol yourself, your database system, you just go figure out what Postgres and MySQL do, and you speak their wire protocol, right? I'm not saying those wire protocols are great. And again, the paper you read showed that Postgres 1 was not good, but it works, and you get all of the client side ecosystem for those database systems for free if you support their wire protocol. So think about this, I'm a new database startup, I'm worried about surviving, I wanna make money, do I wanna spend my time writing a wire protocol? No, I'll just take what Postgres has, then I get all of their client drivers, so JDBC, ODBC drivers for free, I don't have to implement those to myself, and then it's also a nice selling point for customers, you can say, look, you have Postgres now, get rid of that and plot my thing in, and you don't have to change in your application code. Now, obviously that's not 100% true because if you're, the wire protocol may be the same, meaning the packets themselves look the same, and we know how to interpret those packets and execute them correctly, but what's actually inside the packets could be different, because if I don't support Postgres' SQL dialect, then it'll look and smell like Postgres, but it's not actually gonna work because I'm gonna get queries that I can't handle because Postgres could but I can't. The other thing that'll happen too is that beyond just the SQL dialect, there's a lot of applications, especially the administrative applications, when you turn them on and connect to the database, the very first thing they do is go look in the catalog and see what tables do I have. Like if you use some of the visualization tools like MicroStrategy or Tableau, they go in order of them to show you what the list of tables that they have that you can then do visualizations on, they go look in the catalog, like PGTables and Postgres, and say what tables do I have. So even though, again, you support the wire protocol, if they go fishing around in the catalog and don't see things that should be there, then they're gonna crap out because it's not gonna be true Postgres. So it's not enough to just say you implemented the wire protocol. If you wanna be able to say you're a drop in replacement for and truly compatible with these other open source systems, you gotta support all this other crap. And this also only gives you the logical or client side stuff. If you have tools that maybe look at the physical files on the data system itself, then those will be completely different in your new system than would already exist. So as I said, the most common two wire protocols that people copy are MySQL Postgres, which again, makes sense because these are probably the most widely deployed two systems. They're not getting most money, like Oracle gets most money, but there's probably more MySQL Postgres installations than Oracle. I can't prove that, but that's probably true. No, do not restart Windows. All right, sorry. So for MySQL, some notable ones are MemSQL, Clustrix, ActorDB, TIDB, which is out of China, Bedrock, I forget who makes this one, and Amazon Aurora. Aurora is special, let me talk about that in a second. But in the case of MemSQL, like when they first announced the startup in like 2009, or maybe 2010, they came out of the gate supporting the MySQL protocol. Now the way they actually did it was they replaced NODB with their new engine. So they still had the MySQL front end. So that part was still MySQL. Well, when they got rid of all that and wrote everything from scratch, they made sure they kept the MySQL protocol because again, that's a good selling point. They don't say to your customers, you're running MySQL now and you're hitting bottlenecks, give us money, we'll lift out MySQL, plop in MemSQL and everything still just works. So I think from a business standpoint, that was a good idea. For Postgres, there's a bunch of systems like Redshift, Green Plum, and Vertica. These are actually, again, they're derived from Postgres, so they're forks of Postgres. So of course, they kept the front end stuff. In the case of Hyper, Cockroach, our old system, Yugabyte, and Crate, as far as I know, these are fresh or re-implementations of the wire protocol. So we did that in our code and then Cockroach did the same thing. So Aurora is a slightly different beast because they were sort of like what I said about MemSQL or Vertica and Redshift where they were forks of the existing systems. I don't know how much of the code that exists, how much of the original MySQL and Postgres code in Aurora still exists today because they're doing some interesting things like in the case of MySQL, I know they're pulling out some of the wire protocol logic out of the actual database system itself and they're putting it up into the load balancer layer above the database system. So that's pretty wild. But again, from the outside, it'll look and smell like real MySQL and real Postgres, which again is an amazing selling point. For Spark SQL, as they talked about in the paper, they spoke the Hive wire protocol and as far as I know, nobody else does. I couldn't find anything else. Splice Machine supports Apache Derby, like it supports their JDBC drivers. A bunch of different systems do different things and it's usually whatever they're based on unless they implement the wire protocol in Scratch, that let's reuse their JDBC libraries. So now we want to talk about how we actually want to design our protocol. So the paper I had you guys read was out of the MON80B group at CWI in Europe in the Netherlands. So MON80B was one of the major academic column store database systems that actually sort of went beyond the walls of the university and actually is used in a production of a bunch of places. And it's an older one too. It's from around the same time as Vertica, it's like mid 2000s. And they're still working on it today. So for this one, we want to talk about how we're actually going to serialize the data or the result of a query and put it into packets and then we send that over our wire protocol. As I said, we don't care about the queries coming in, it's really how we're getting the data out. And we're going to focus on how we're doing bulk data exporter or large result sets. If it's OLTP, we're not reading that many, each query is only going to return a small number of tuples, small number of results. So we care but not care that much. It's when we really want to start exporting large data sets that these things become an issue. We want to look at alternatives. So we'll talk about how to do row versus column layout, compression, serialization, and then how to handle strings. And then I'll show some benchmarks that they provide. So again, the one thing to think about also too is like whatever optimization we talk about here today, we have to implement in our database server but we also have to be able to implement it in our client driver. So whatever crazy compression scheme we come up with in the server, it has to be able to support that on the client side as well because it has to uncompress or decompress whatever you send it. So I think a lot of the times the client drivers are quite conservative and because you don't know whether you're running on a big machine or you're running on a cell phone. So typically, traditionally, JDBC and ODBC drivers try to be quite lightweight whereas in some of these schemes, if you do more heavy compression then you have to decompress it on the other side and that may not be good for your environment. But again, they're focusing on large data exports. So you're not gonna download one terabyte of data on your cell phone through SQL, that'd be stupid. All right, so the first one is row versus column. So ODBC and JDBC, by the very nature of how these APIs are defined, are row oriented. Like you open up a connection to the database, you run a query and then you get a cursor back or you get an iterator back and you have a while loop that calls fetch next row and that's gonna get you one row at a time through that call. And this is partly because they were designed from the time when in the 1990s, we weren't doing machine learning and other things, we were doing business applications or pretty simple data analysis where we're gonna go get, run a transaction and go get one tuple at a time or go get a small number of tuples at a time. So this API totally made sense for that class of workloads which is an important class of workloads and it's very common, but for data export it's not gonna be ideal. So all of the same storage model optimizations we talked about last time for rows versus columns totally still makes sense here. For analytical queries, we wanna try to process vectors of columns as much as possible because that'll, one, that's what these modern data science applications actually want, like TensorFlow, Spark and PyTorch, there's machine learning stuff, they want things to be in vectors and matrices and columns. And if we had to transform it into rows then that's gonna be expensive. So if we sort of target what we think the application is gonna do with it correctly then that would tell us we probably wanna be in a column but then also too when we do compression and other methods of storing things more efficiently in packets, having all the values within some stride of memory in the packet that have them all be part of the same domain means I can get better compression because I can do RLE or Delta encoding all the stuff that we talked about before. So many techniques for compression which I'll see in the next slide can still be applied here if we organize things in columns because again, that's gonna be better for what the application is gonna want. So the way they recommend you actually do this is that instead of storing in your packets and sending back within one packet only data for a single column which you would have in a true column store what they recommend is that within one batch or one packet you send back a batch of rows but then organize them in a column-oriented fashion. That's essentially how we implement storage in the current system now, the new system. This is called the Pax model. So every block will have a, every block has the tuples, so every block has all the data for a single tuple but the way we laid out the data for that tuple is in a column-oriented fashion. So we're not doing it continuous memory for all the attributes. It's for a single attribute for across all tuples we store them continuously. So that's what they recommend to do here to get better performance for these data analysis platforms. So once we do that, now we can also do compression. And again, this is the same thing we talked about for storage. We can either do a naive compression like GZIP or Snappy, whatever your favorite compression algorithm is, Z-standard. Or we can do this in a sort of a columnar-specific encoding where we take the advantage of the fact we're storing things as vectors. They're all gonna be in the same domain and then we can do RLE and Delta encoding or dictionary encoding. So they end up arguing in the paper that naive compression is actually the better approach for sending network packets, which is different than what we talked about with storage in the database storage. And that's because it's agnostic to the layout of the actual data. So you don't need to do in the networking layer as you're constructing these packets to send them back to the database server or to the client. You don't have to look at the schema and figure out like, all right, my data is this type and this length and I have this and that many values and therefore what's on the fly to figure out what the best compression scheme is. You just serialize all your data within the packet, run GZIP or Snappy on it, and then you're done. It's super easy. So they argue from a software engineering standpoint, it's just easier to do naive compression because you don't have to implement all this extra stuff. And then furthermore, as I said before, whatever, you know, if we use columnar compression, we have to implement that on the client side too, right? And then that means we're basically implementing this logic in two locations, actually in three locations if you want to do ODBC plus JDBC. So you just do naive compression. On the client side, there'll be libraries you can use to decompress things and then you're done. So I think I agree with that. The other thing to point out too is that and this is sort of obvious is that the slower your network is between the server and the client, then the more heavyweight compression scheme you want to use. Because if you're bottlenecked on bandwidth, then you're willing to pay the CPU cost to use a more aggressive compression scheme that'll give you a better reduction in the amount of data that you're storing, right? So again, that's no brainer. I mean, that's sort of a standard trade-off that we talked about before. All right, the next issue is that how we're actually going to represent the data we want to send back in our packets. So the two approaches to do this are either binary or text encoding. So the binary encoding is just in the same way that we would organize or lay out the data inside of our database. I said that we had these, the primitives would be stored in the way that C++ stores them. And then if we have floats, we could use the fixed point or the floating point numbers, right? All that is the same thing we want to do here. Represent the data in a binary form. Now, the thing we have to be careful about now is because now we're sending data outside of our system to some other machine or some other client. We need to be mindful about the endianess of our data and make sure that the client and server are in sync. Right, because it would be bad if I send over packets in the server and the server's big endian and I land on the client and the client's little endian and it can't handle it correctly. So, whether you do this, I think you can always do this in the client. I think that you could do it in either one, the server or the client, but it's better to do it in the client because then you don't have to worry about what it actually is. So, the wire protocol specifies what the endianess will be and the client needs to know that I'm on this machine and my endianess is different than what I got, so I make sure I flip things around, right? But, actually, out of this is not really a big deal because most machines are x86, right? Yes, your phone will be armed or there might be some power machines, so in that case, you have to handle that. In practice, it's probably not that common. So, the other thing I'll say too is that the closer to the internal binary format that the database uses to represent data, the closer that is to what your wire protocol expects, then the less work you have to do to take the result of a query and put into packets. So, we had this problem in the old Peloton system because how Peloton would represent data in the actual database itself was different than what Postgres expected on its wire protocol because we implemented the Postgres wire protocol. So, we'd have to basically copy the data like two or three times, at least two times, probably three times if that was a mistake, but you had to copy at least twice to transform it into the format that Postgres expected, right? Because Postgres was representing data in the real Postgres system completely different than how we were doing it. So, if you can match what the wire protocol is gonna use and with what you use in the internal database, then it's much more quick, there's less overhead to actually generate the packets. Now again, if you're copying the MySQL Postgres wire protocol, it's unavoidable. The other alternative, you could convert it to the binary format that's specific to your system or your wire protocol. Another alternative is actually rely on some of these existing serialization formats that exist for sort of general purpose distributed systems. You just reuse those inside of our database system and store the data in packets that way. So, if you ever heard of Google protocol buffers or thrift, these are like serialization libraries. Define a schema for what a message would look like. Here's a bunch of primitives. Here's maybe some message structure. And then they have a compiler that would then generate C code or Java code, whatever language you're trying to write in. They will generate you code that allows you to construct these messages and then serialize them into a compressed binary form. So, in the case of Hive, in the paper you guys read, the Hive protocol, they use thrift. In practice, I don't know of any other major database system that actually uses one of these two things because there's a bunch of extra stuff that they do and that they store in the messages themselves of the actual data that we don't actually want in our database system because we already know what the schema is gonna be because for a database we have that. And there's other extra stuff like they do extra copies to get the data into the byte buffers that they expect that we don't wanna do. So, in case of protocol buffers, in addition to actually the data we would send over, they also include some schema version information to keep track of what version of the message structure is this because they're worried about running, again, not in database systems but in disparate distributed systems. The other approach, the alternative to binary coding is your text encoding. And this is the simplest thing to do. You take whatever you're storing, whatever your data you wanna send over and you convert it to a string. And then on the client side, they just know how to do the reverse. They take the string that you sent them and then put it back into its correct primitive form. So, if you wanted to convert integers to strings and C, it's just A to I. So, the command will look sort of like this. So, say I wanna send over a 32-bit integer, four bytes, the value one, two, three, four, five, six. So, I would convert that literally to just the string with the characters one, two, three, four, five, six. I send it over the network, right? Then on the client side, it knows how to do conversion into the binary form that it expects to hand off to then the JDBC, ODBC layer. So, what's nice about this is you don't have to worry about any of the Indianist stuff because the client will just take care of that for you, right? The downside though is you end up storing more data than you would otherwise in the binary form. So, again, this is a 32-bit integer here, one, two, three, four, five, six. So, we can represent that as exactly four bytes. If I send the string over, then there's six characters in the string. So, that's at least six bytes right there. But then I also need to store either what's the length of the string or a null terminator character to say when the string stops. And that's gonna be more than, you know, it's gonna be some more extra bits or bytes. So, this would be at least six bytes, but probably more. So, what I could have represented as four bytes, I now represent as six or seven bytes. So, that may not be a good trade-off. So, is this clear? All right, the last thing to talk about is how do you handle strings? So, there's three approaches to do this. And this is sort of the, again, the same stuff we talked about with variable length storage in databases in general, right? So, you can have, for a variable length string, you can denote where it ends just with the null terminator character. So, now on the client side, as I'm parsing my packet and I have a string field, to figure out where it ends, I keep scanning until I find the null terminator. The alternative is to just prefix the string with the length of the string in front of it, which is what we do in our internal system for Barlin. So, that again allows you to figure out where a field stops and it stops and stops. And the last approach is to do fixed wing, fixed with strings where you just take whatever the max size of the string, which you know because we have a schema, we know what the attribute looks like. And then we just pat out the trailing characters with spaces to make it sure it fits exactly what the max size is, right? So, the paper talks about that in different workloads, in different scenarios, sometimes this one's faster, sometimes this one's faster. There's no one that's better than another and different databases do different things, but nobody's gonna do both. No one's gonna spend the time to actually alternate between these two based on what few bytes or bits they can save for your packets. They just pick one and live with it, right? The last one was fastest in, I think their experiments, when the max size of this thing was really small. Like if it was like one character, then this is gonna beat anything else, right? But depending on, that's not always that common. All right, so the last thing I'll also point out too is for all these four different design decisions we have to consider, they're not independent. And in particular, ease is one of the reason about it is compression. So, depending on what, how I represent my strings, that may determine whether one compression scheme is better than another. So, and it depends on what the actual data looks like you wanna store. So, if I'm using the fixed width strings and my string size, max string size is 128 characters, but most of my strings are two characters, then I'm gonna have all of these empty spaces after them. That's gonna be amazing for standard snappy compression, because it can be able to trim out or reduce all those space characters. So, that might make fixed width work really, really well, but in other scenarios, it might perform poorly and this one actually might be better. So, the main takeaway about this is that these aren't independent. There's different trade-offs for each of them and I don't think it's possible to say that there's one implementation that is better than all of them. All right, so real quickly, let's look at some benchmarks. So, this is provided to us by the author Hans at CWI. So, for this, I'm gonna show two results. I'm gonna show one result where we only transfer one tuple and then another result where we transfer a million tuples. So, for this, they're gonna use the line item table from the TPC-H benchmark and they're gonna use JDBC for, sorry, ODBC for all these systems, except for Hive, they're gonna use JDBC. So, it's quite an awesome mix of systems that they have, because they have an ABDB, which is what they work on, then they have MySQL using compression and no compression, Postgres, and then commercial guys, Oracle, Mongo, and DB2. And then Hive, again, is representing the sort of JDBC thing. So, across the board, what you see is that the MySQL actually performs the best here. To go grab one tuple, it does quite well. Even with compression, it does better than everyone else. And I forget exactly why they claimed this was the, they performed better. I think it wasn't so much, I think it was sort of the networking stack itself of how they were able to get data in and out of the system and not how, what the actual packets look like. The other surprising thing was, so for Hive here, I don't know whether it's slow because it's Java or because there's some issues with HDFS, or just the protocol itself is super slow, but it's actually surprised me how bad DB2 was, because this is a commercial system, they have money, they could fix these things. So, to avoid the overhead of sort of to narrow the scope of the measurements just to be about getting data out and serializing it, for this the way they set it up was they would run the query multiple times on the server, so that the system would cache the query plan and cache in some cases the actual result. So it's not like we're measuring things like going through the SQL parsing layer and then actually running through the optimizer or generate query plan, all that's cache ahead of time. So it's really how fast we can get data in and out. All right, so the other thing to point out too is also that monadb here is using text encoding, but all these other guys are using binary coding. So again, you would think this thing would be terrible because it seems like a really dumb thing to do, but in practice for them, actually it works quite well. Again, these guys are doing binary coding, but they're the worst. So now let's look at the same setup of the experiment, but now instead of grabbing one tuple, we're going to grab a million tuples, and then we're going to scale along the x-axis here, the network latency. So they're going to purposely make the network slower and see how that affects the performance of the system, and how long it takes to get data out. So I want to focus on the MySQL with compression. There's no way to get PowerPoint to show, I guess I got to reorder them up there, but there was no way to show MySQL first without having to show this other line. So you ignore the black line. The same trend is going to look like everyone else. The thing I want to point out here in the case of MySQL is that with compression, compression ends up being the main bottleneck, because they're using GZIP, which is a slow compression scheme. So the CPU overhead is essentially the main issue here, and that's why it's fixed no matter how slow the network gets. I mean, it's getting a little bit slower, but not by that much. Again, because the size of the data you're actually compressing and sending over is the same. So it's the same CPU costs for all of these, and that's the penalty you're paying. Now, everyone else is going to get slower, because they're not doing compression, as the network gets slower. The interesting thing to point out here is DB2 ends up being the worst, and Oracle is actually the second best here, when the network's really fast, but it ends up being the second worst here when the network's really slow. I forget why they said this was the case. I don't even remember that. Yeah, so that was one finding I don't have an answer to. So what can we take away from this? There's no one protocol that's better than another. Compression you probably want to use selectively based on what the data is going to return. Again, doing this would be pretty simple. The packet you say whether something's compressed or not, and you can decide whether you're shipping over a lot of data, whether you want to compress it ahead of time. But maybe you probably don't want to use GZIP, something more lightweight, like snappy, or D-Standard would be the better way to go. So any questions about wire protocol design? Again, this is focused on transferring a lot of data. If it's a single packet or a single tuple, we should go find out whatever my SQL does and we should do the same thing, right? Okay, so for all of these examples here, the, as I said, they're communicating with the client over TCP. And so the way you do that is going through the operating system. So they're all going to pay the same penalty to, you know, matter what kind of, how they form their packets, whether they're using compression or not, all that's up in user space. At the end of the day, we need to put packets into buffers and then send that over the network and going through TCP, which means we have to go through the OS, right? The problem is though, the OS, as I said, is slow. I mean, I should be careful about this. The TCP stack, it's slow, right? Because the way it works is that you get messages in and out, they're going to do this through contact switches, which are expensive. You're going to have to interrupt to say, hey, your packet's here, you're sending a packet. And they're doing that through interrupts, which now means that we fall down to the kernel. The kernel takes its own latches to protect its data structures. And again, we're doing contact switches and that's not cheap. The other issue we're going to have to is in order to send a message, the OS wants to maintain its own memory because when I handed a packet to send it over the network, it doesn't want, when I go back up into my program, it doesn't want me to now mess around with that memory, maybe deallocate it before it gets to actually on the hardware device and send it over the network. So the OS is going to maintain its own buffers for the Ethernet cards or the NICs that means it has to copy whatever data you send it into those buffers before it can send it. So that's actually a bunch of actual copying that we don't want to do, that we want to try to avoid. And of course, as I said already, they already take, the OS is maintaining its own latches for its own data structures and we can get bottlenecked on that if we have other contending threads coming in doing it. So since the OS is our front of me, we want to try to avoid it, right? It's like my parents, they would've ever tried in front of avoid them. So, all right, so to avoid the operating system, we can use a technique called kernel bypass. And the basic idea here is that our program, our database systems process is going to allow to interact with the hardware directly to send messages over the network and receive messages, but without having to tell the operating system about it without having to do any copying on it. So we're literally going to have memory buffers that we can share with the actual, the hardware device itself. And we can get data into that and we can get data out of it and put data into it to send over the network. So, the advantages should be obvious, right? We're doing the last copying, we're not going through the OS for any of this, we're going to get much better performance. So the two ways to do this are to use the DPDK, the data plane development kit, or to do remote direct memory access. So, I didn't know really how to categorize these, but I'll say like the DPDK is an actual thing, it's a library you can use called the DPDK. Remote direct memory access is more of a technique or a method, and there's libraries that provide RDMA for you, right? Just, but at a high level, they're doing different things, but just be mindful of like, this is the thing you can download, this is the thing you download something else that implements this, or provides this. All right, so the DPDK was originally a library provided and written by Intel that allowed you to access your program, or database system running in user space, it allows it to access the NIC hardware directly. And the basic idea is this like, it exposes through this library, buffers that are on the NIC, and you can fill them in with packets you want to send, and then you pass them along to the hardware itself, and then there's a mechanism to get notified when things show up that you want to read, that belong to your process, right? So this is not a magic library, you just download and link it into your application, you actually have to go rewrite your application to use their, to use the DPDK calls to send messages and receive messages, instead of making the SysCalls to Berkeley, sockets or the operating system. So you have to refactor your code in order to take advantage of this, right? So like think of this as like a standard programming library for accessing bare metal hardware on the NIC. So there's a bunch of stuff that the OS maybe would do for you to help you, you have to do that yourself, right? So the, there's not many very systems that actually use this, the only one I'm aware of is called ScaliaDB. ScaliaDB is a re-implementation of Cassandra in C++. So Cassandra is in Java, they basically rewrote everything, like the wire protocol, but actually the actual storage system itself and the execution engine, all of that's being rewritten in C++. And one of the ways they get better performance is that they have this library based on the DPDK called CSTAR that supports this kernel bypass method and they built that library and then on top of that they built ScaliaDB. They also have a version of Memcache that runs on CSTAR, but ScaliaDB is their main product here. So this sounds amazing, but as I said, it's not trivial to write unless you're using like an already written library like CSTAR. I also don't know how portable it actually is. Originally it was only for Intel, NICs, right? Because again, they were trying to sell their hardware. Intel then handed off development to the Linux Foundation, so now they own it. I don't know how many other vendors actually support it. I don't know, we can just Google this. I don't know whether you can run this on EC2, right? You do have to buy NICs that actually have this capability. So it's not like it's universal, your laptop's not gonna be able to do this. To give you an idea again, how bad this actually, how difficult this can actually be, I got this great tweet a few years ago from somebody and he's talking about the SPDK, which is the storage plan data kit. It's like the DBDK, but actually communicating with disk drives, same idea as doing kernel bypass. And he basically says, the kernel bypass methods are great. It's like peeing your own pants to keep yourself warm. It's a good idea at the beginning, but then you regret it later on, right? So I think it's a really apt metaphor for this. The other kernel bypass method is called Remote Direct Memory Access, or DMA. And the way to think about this is that it'll be this library that you can get that allows you to reference memory location stored on other machines. And it's almost like you're reading memory that's local to you. You do an access or a load on that memory, then underneath the covers, the hardware can intercept that, recognize that the data you're trying to access is on another machine, go down through the NIC, go to the network, get the data you want and do your reads and writes. So for this one, I don't know how much of your application you have to change. I think there's some stuff you have to be aware of because you have to figure out, you have to know what the memory address you want to read on the other machine is and then that server has to be able to allow remote programs to access it. So it's not like you just, again, you don't just link it in, it magically happens. The other tricky thing about this is that the server that you're reading from or DMA, it doesn't know you're doing anything. It doesn't know you're reading and writing. So there's no callback mechanism to say, I read something, I just got read by somebody else. So there's some extra work you have to do to know when someone has read something that you wanted them to read or whether they've read something. There's more stuff you have to do other than just linking in or extending your address space with remote memory. So the most famous system that uses RDMA, again, you have to buy expensive, so this one you have to buy specialized hardware that supports RDMA. Like I think the commercial version is called Infant-a-Ban. Have you ever heard of that? Melanoc sells them, they're very expensive. I think you can do RDMA over Ethernet. I don't know how widely supported that is, but Oracle Rack is probably the most famous one in exadata. They rely on expensive Infant-a-Ban drives to do remote direct RDMA to other machines and data stores. Microsoft has this interesting research system. I don't know if it's actually in production called Farm. And this is pretty crazy, like in order to get RDMA to work in a transactional environment, they have to do like four phase commit, because again, you don't know when somebody reads and writes data. So there's a bunch of extra steps you have to do to figure out when you're allowed to commit your transaction. So again, the main takeaway from these is that the OS is gonna be in the way, so even if no matter how optimized we make our wire protocol, if we're still going through the OS, then it's not gonna be as performant or efficient as we would want, okay? All right, so I rushed through this very quickly because I wanna get to the second project, but again, it's everything I already said. So the networking protocol is an important aspect of a DAV system. We're trying to implement Postgres here, and for OOTP it's fine, for OLAP queries it may be problematic. And then kernel bypass makes things, can improve efficiency, but you probably wouldn't see kernel bypass methods actually for the wire protocol stuff. You'd primarily see it for the internal messaging between if you're a distributed database, but there's no reason you couldn't do it for a wire protocol, okay? All right, project two. So as I said before, in the beginning of the semester, this is a group project, and the idea is that there's a large portion of your final grade for the course. There's a reason why this counts as a system elective. So the expectation is that everyone's gonna implement a large piece or large component or feature in the database system that we're building here at Carnegie Mellon. So the projects that you can target could be the things I'm gonna present today, things we're gonna focus on in the course, or if there's some alternative thing that you really wanna build because it's related to your own research or whatever you're doing in another class, then I'm game for that, but you obviously have to get approval from me first, right? The other important thing is that every group has to do something unique. So you can't have two groups doing the same project because that'd be stupid. So everyone has to pick something that is different from everyone else and because it'll end up being first come, first serve. So if you really wanna do something and you're dying to do something, make sure your group agrees to it and you should sign up for it right away, okay? All right, so what do you have to do with this project? What are the deliverables? So I'm gonna go through each of these one by one, but basically you're sort of getting a proposal to the class to say what you're gonna do, then you'll have a status update and say, well, we actually started implementing this, here's what we found out, here's some issues we had, and then a design doc describes what your overall implementation's gonna be, and then we're actually gonna do two rounds of code reviews. We're actually gonna do peer review of the code from other people and they'll review yours, and that way you're not doing stupid things and you get to see what other people's code looks like. Then we'll do the final presentation and then we'll have the code drop to get your final grade. Okay, so I'll go through each of these. All right, so the first Monday after spring break, everyone's gonna come up on here and with your group and give a five minute presentation to describe what you plan on doing. So we won't record this, I don't feel like you're gonna be embarrassed, I don't care, right? And so for your proposal, you're gonna talk about not just say, hey, here's a high level thing we wanna do, you actually wanna describe, here's the files we think we're gonna have to modify or here's what we're gonna have to end up building because it doesn't exist now in order to implement our project. So this is forcing you to actually think through like, oh, is this a good idea or not? Am I going down a rabbit hole that it's just not gonna work out, right? Like one group, one of the say, would they wanna implement views? And I was like, what about this? What about this? What about this? And it's clearly they had not thought about this at all, right? And had they chosen that project, they would have never finished. So part of this is again forcing you to figure out what is actually gonna need you to be able to do to implement your project and then we can scope out whether that's actually feasible from now until the end of the semester. This is slightly less important because we don't support SQL in our system now, which I'll talk about in a second, but part of the reason why I had you guys write down when you do the reading reviews, what workloads they're using, because that at least sort of think about like, for what you're building, what workload are you targeting and therefore what kind of benchmark you should be writing, what test cases you should be running. All right, so at least we don't think about that. So then we do a status update and that'll be I think the first week of April or so. And that's where you come back up here in front of the class and give another presentation about what you guys have implemented so far. And again, for the proposal, you definitely should look at the code and maybe start writing some of it. But at this point here, you should be far enough along that you can talk reasonably about how things are going. You can also talk about it's okay if it turns out like what you end up doing is it was too complicated and we had to switch to something else but you need to explain to the class why you made that change. Also what the purpose of you guys coming in front of everyone else and talking about it, instead of just coming and talking to me is like what'll happen is there'll be certain features that one group needs and then maybe they build but then it turns out another group needs the same thing and you guys can then share codes. We certainly had groups do this before where they were sending pull requests to each other and then swapping codes and tests and things like that which is actually pretty cool. I think anything that's surprising during the process. So now what we are doing new this year as well is that we're requiring you to write a design document that lays out not in code examples but actually in English what it is that you're actually building and why are you building them that way and what are the ramifications or implications of your design. So we'll have a template available for you. It's a markdown file that have all these different sections that describe what's expected for each of these. So basically how you're gonna build it, why are you building it, why are you gonna test it, what are some issues you can foresee with your implementation and then if someone to come behind you and continue your project, what could they work on? Okay? And that's to be do the same day as the status update because by then you've worked on the project long enough you can talk intelligently about it. Then we're doing the code reviews again. So this will happen twice in the semester. It'll be one around the status update and then one the week before the final due date we'll pair each other up or just use GitHub. So they'll submit a pull request onto the main repository on GitHub. Then the other group would then do a review and they can comment on it and provide feedback about what changes you made. So to make the other group's job easier it's not just you throw the code up and say, hey, look at it. You actually provide in the write up, hey, here's the policy modified, here's the things we want you to look at and here's how it actually works, right? So for this, the grading for this part will be based on participation because I want everyone to participate in all the code reviews. So I don't want it to be like, one person does the first review and then the second person does the other review. Everyone has to participate equally in both steps, right? Because then seeing the same code multiple times in the two of code reviews you'll have a better understanding what they're doing and therefore you'll make better comments and get better feedback, okay? So I can't stress this enough. This is like super important like when you go out in the real world. In classes you write a bunch of code and then the semester's over and you walk away, right? At your job you're not gonna write a bunch of code and then throw it over the fence and assume someone else is gonna take it, right? You're gonna have to do code reviews, you're gonna have to have people review your code and get feedback. And then you end up actually learning a lot too because you learn other tricks about writing C++ and other techniques that you may have not encountered and like things that we're not gonna teach you in a class, right? I learn things all the time when I do code reviews. So that's fun. And so I'll do a lecture or sort of a mini lecture once in one class, you know, how to actually do a good code review on what's expected. All right, final presentation. It'll be whenever our final exam date is. Do we have a date yet or no? What date? Two. 30. Is it, what time? Eight. I don't think it's time. One year, last year they gave us 8.30 which was retarded. 8.30. Really? Also, it's Monday. Monday makes me a better date. It's Monday makes six? 8.30, do it at 11.30 in the morning. All right. All right. So now when we get pizza, we'll get bagels or doughnuts, whatever, right? We'll do it in this room too because like they put us in a big room and I told him not to. Whatever, all right. So you do a final presentation, basically like here's everything you did. Like giving a demo's are awesome. Showing like benchmarks are awesome, right? So like, here's what you do. showing like benchmarks are awesome, not just like presenting it, actually showing the thing actually runs will be really cool. Like when your kids were giving demos of like SQL extensions and stuff like that. So I highly encourage you for both the final presentation and the status update, give demos. So this is just sort of as a sort of saying what we've done. You don't actually get a final grade though, until you submit a pull request on GitHub that can cleanly merge into the master branch, pass all the tests and verifications and provide the documentation, like an updated design document that describes what you actually did. So the goal for this is to merge your code into the master branch. Again, you want this code to live beyond this class. Whether you're going to hang out in the summer with us, doing research or come back in the fall and do like a capstone project of your master student, we're going to continue working on this thing. So you want your code to be as useful as possible to others. So now I realize if we start merging, then that's going to cause collisions with your pull request. So what we've done in the past, we just do this, we do a random order, or we decide which things we actually want to merge now, maybe merge later. So the goal should be that your code can merge cleanly. But if we end up with collisions after you submit your PR, then I'll be mindful of that. So we'll give you another 50 bucks on Amazon AWS credits. When you submit your PR to our repository on GitHub, that automatically triggers builds on Travis and Jenkins build cluster that we have here at CMU. Those machines are much nicer than the Travis machines. If you think you need special hardware, which this year I don't think anybody should, but like one year somebody wanted some fancy SSDs, so we got some from them. If you need special hardware, let me know. Again, don't run out of money on Amazon. So now I want to talk about the project. So I want to have a huge disclaimer. So in the previous years, it was all based on Palatine. And as I said, we threw away the code because it was a train wreck. And we're not at the state we were in previous years where we have a full functioning database system that you can open up the terminal and start sending queries to. So we're working progress. And we're actually working on it now like we have a full-time engineer on the 9.4 helping us build this thing. And a bunch of you guys in the class are helping us build this thing. So these projects are gonna be very narrow and focused and they're gonna be at the sort of lower level parts of the system. If you look at the previous years where people have done, it's like adding SQL functions or sequences or temp tables. All that is sort of in the upper levels of the system. We just don't have that yet. So that's why for the first project I had you guys do like low level perf analysis because that's what you're gonna have to deal with to implement one of these projects. Okay? So again, we're don't think of this as an adversarial thing. Like you're not competing against each other. We're all sort of working together. If there's some feature that somebody needs and it doesn't exist, but another team is gonna build it because they need it, we should be sharing that code. That's why when it sort of, everyone should be talking to each other as we go along. Or if there's something you need and doesn't exist and you don't know how to build it, I could get one of the team members to build it potentially. Okay? All right, so project topics. There's eight groups and I'm proposing here five or 10 projects, but I'm open for suggestions if you have other things you're really interested in working on. So query optimization, I'll just, we're short on time, so I'll go through all of these one by one. All right. So we have a query optimizer written from scratch. I had an awesome master student two years ago build a brand new query optimizer from scratch that follows the Cascades model. We'll teach you Cascades after spring break. It's, in my opinion, the state-of-the-art limitation that's what SQL Server uses and SQL Server is the best query optimizer that's out there. So we have one already, but we wanna expand it and go beyond what we can already do. In particular, we don't have a good cost model. So we don't have a good way to say that this one query plan is better than another. So if you're scared about modifying the query optimizer, if you just focus on collecting stats and doing a building a better cost model, you can do query or expression rewriting, like taking predicates, like where A equals one and A equals two, right? Like that it's always false, so you can rewrite that to be false. You sort of do that before you get to the optimizer, but it'll use some of the same techniques that we have in our implementation. And then also too, we were interested in adding support for auto joints. So now I'm showing this first because if you choose to do this, in addition to you signing up, they work on the query optimizer, you also have to send me your CV because this is what every single database company that I ever talked to, this is what they wanna hire. They wanna hire good engineers, that's fine. They really want people that have new query optimization. So I get emails all the time that look like something like this. Like, hey, if you know anybody that's a loose query optimizer dude in the Bay area, we wanna hire them, right? We don't care about Java and all this other crap, but like we want query optimization people. There's another dude from another startup. He's a bit more profane about his request, which is fine. So think about this. People that work in query optimizers are really hot in the 90s. And so all the dudes that work in query optimizers are like old now, right? And they're not gonna go leave and go to startups. But all these new database startups realize they need a query optimizer so they try to hire somebody. So if you do this, you will get hired. If you do it well, you will get hired, okay? I guarantee that. All right, the next thing is we wanna do add drop indexes. So when, and one of my PG students has been working on adding support for adding back the BWTree into our database system so that we can have indexes. So we wanna actually now add the ability to build an index. So what does that mean? It means doing a scan on a table and populating it. And then for dropping it, dropping it is pretty easy, but that's not hard. It's really the adding, it's the hard one. Because you wanna be able to do this transactionally without blocking everything. So being able to, so the goal of this project could be add support for creating index. First in a blocking way where you pause all the threads and then you build the index. Then the next approach to do this in a non-blocking way where you have one thread build the index, then you keep track of what the other threads are doing so that when you go to commit, if they modify the table in a way that you missed when you built the index, you can go back and add them in, right? Or you know the Postgres way. Postgres actually does it in two passes. And there's flags that say whether you wanna do a concurrent or not. The, another stretch goal would be to actually support building index in parallel so I have multiple threads build the index at the same time. But just having the ability to build the index in a non-blocking fashion would be huge. And we'll touch again a lot of different parts of the system. Yes? Does that not depend on the existence of the catalog? His question is, and it gets back to my point before, does this not depend on the existence of the catalog? Yes and no. Drop index yes because you wanna drop it from the catalog but you can build the index without worrying about a catalog because you just wanna populate it, right? We don't have query plans either yet. We don't have an optimizer. So no one's gonna be using the index unless it's hard-coded to do that. So you can implement this without catalogs without a parser without query optimizer. All right, so I haven't really talked about the major goal of the system we're building here at CMU but the short end of it is that we're trying to build it to be autonomous. Meaning we wanna use machine learning on the inside to figure out how to tune itself automatically. Oracle has their own, what they call the self-driving database. We're calling ours a self-driving database. There's you can buy, ours is better but it doesn't exist yet but we'll get to that later. So what we need from the old system, we don't have the new system is we need to be able to collect the metrics so you wanna be able to record here's the low-level things that my database is doing as it executes queries. Locks held, latch wage time, pages read, pages written like all those low-level stats. We wanna collect them as we execute transactions and then store them in our own database. So we had a framework from the old system that was actually a project last year in the class that could do this in an efficient non-blocking manner. We basically wanna be able to revive that and integrate that into our system. So it's not just, to be clear, it's not just taking the old code and copying it back over, it's gonna be quite a bit of work to actually get this to work correctly. And then the stretch goal would be if we can actually do this and we can start reading out data about what happens when we execute transactions, maybe we can build a little mini machine learning model to predict various things in the system, like how long are we gonna have to wait for a latch for each transaction or each query. The next thing is at support for settings. So this sounds like it'd be super simple, right? I wanna have just a configuration file that I load in and then keep track of how many, the size of a buffer or things like that. If you wanna do this transaction and do this in the context of a self-driving database, it actually becomes quite complicated. So essentially for this project, what you want is you build a new settings manager, which is in the catalog, right? In Postgres it's called PG Settings and think it's just a giant hash map that says key value, like here's the setting and here's the value. So then we would add support. So now when you modify this, it's not just flipping the value in that hash map, you then wanna fire off a trigger that then is responsible for going and updating the component of the system that is using that value to then reconfigure to represent the change you made. So for example, if I have a bunch of garbage cluster threads and I've defined in my settings that I'm allowed five garbage cluster threads, if now I change that to four, that would fire off a trigger that then says, go to the garbage cluster and says, you now have one less thread reorganize. So it's not just flipping the values in PG Settings, it's actually then writing the runtime code that reconfigures the system. Again, the goal for the self-driving part is gonna be we wanna be able to do all this automatically. So we wanna be able to have something modified that the values we have in PG Settings that then fire off the triggers to reconfigure the system. And we wanna observe what happens when we do this. We wanna do checkpoints of recovery as we talked about last class. So right now we can do right ahead logs. So if we make changes to a table, we can correctly record the redo entries and sort them out the disk. We cannot do checkpoints and we cannot fully recover from the log. So what that means that for checkpoints is obvious, we don't have them at all. Recover from the log means that we can repopulate the tables, but we can't set up what the catalogs were to say, here's what the schema was. So for this project, what you would end up doing is you start off with basically doing checkpoints on just the catalog table, which we're adding in a week or two. Then you wanna implement recovery that allows you to load that checkpoint in, reinstationate the catalog and then replay the right ahead log to populate the tables again. Then the next step would then be extend your checkpoints now, do a consistent snapshot of the database, store that out into the checkpoint along with the catalogs and then be able to recover that as well. So I think, again, this is not gonna be easy, but I think it'll touch a lot of different parts of the system and you have a good understanding of what the storage manager's actually doing. All right, I'm gonna go through this real quickly. Sorry, I'm way over time. So we're gonna do unified garbage collection. So right now we have a garbage collector for the tables and garbage collection for the BW tree. We wanna have them be a single epoch management system. Next one is that we have one new interval garbage collection. So the same as you read in the HANA paper, we wanna do basically their technique. Right now we only do minimum timestamp, but we wanna be able to identify here's a region of versions that we know we don't need anymore because no transaction can see them. So let me excise them now and let me go ahead and do a garbage collection. So for this, it's not getting the exact implementation what they implement in the paper because how they organize transactions is different than how we do things. They have those version groups or whatever they're called. They were sort of clustering things. We don't have that. If you love project one, which is due today, we can go much deeper into this. So if you wanna do additional performance analysis and optimizations in really any part of the system, so basically running call grind and running perf, identifying bottlenecks, identifying ways to re-architect the system to make it more efficient, then this is something you can pursue. I know of these one or two other parts of the system where we could do something like this. So the idea could be that you could shut off a running perf and call grind on the other workloads we didn't have you examine, identify what the bottlenecks are, and then propose solutions to them. So right now, the one in particular I know about is like for inserts, we always go to the same block. Every thread will always try to insert to the same block, but we could easily parallelize that. All right, the last two I wanna quickly talk about these are probably gonna be the two hardest ones. So I haven't described what query compilation is. I haven't described what the LLVM is, but the basic idea is that the query plan shows up and instead of me, and it's a tree structure, instead of me traversing the tree and exiting the queries and doing, sending tuples up, or pulling tuples up the query plan tree, I actually then convert the query plan into an intermediate language which I then compile with LLVM, like Clang or GCC. I compile it into machine code, and then I basically have machine code that's hard coded for my query, and I can run that way more efficiently than I can do from interpreting the query plan. Right, we'll have a whole lecture on this. The performance difference is quite significant. So the old system had this, but it was a to work with because if your program, if your generated code crashes, you don't land with the stack trace like you do in like an interpretive version. Like if you're in GDB, you don't have a stack trace. You land in x86 assembly. So it's really hard for you to figure out what the hell's going on. So we have a new engine that doesn't have this problem because they actually converts it to a DSL that looks like C, then we can interpret, then we compile that into op codes, then we can interpret those op codes, and then you can actually step through and see what's actually going on and match that up with the source code that actually generated those op codes, and then if that works, then you can fire it off to LLVM and compile that and not worry about having to debug x86. So my PG student Prashant, he has a new engine that does this technique, but right now it's in a separate code base. So for this project, the goal would be to bring over his new engine because we want to do this anyway, bring over his new engine and integrate it into the full system. Because right now what we have is we have the top layer, we're working on networking, we have a query parser, we're working on the binder and the optimizer, then we have a storage manager, but there's nothing, the thing in between is the extrusion engine, and that's what this would be. So this would be working with the PG student and helping him bring over the new engine and integrate it into the system. So you will touch everything to make this work. And I'm sort of cautioning the thing, it's hard because you have to understand LLVM stuff and other things, but we can help. Related to this, the last one is also to extend the existing engine to actually support index scans. So right now we can only do sequential scans. So you would have to modify the actual LLVM compiler stuff that he has and his domain-specific language and the opcodes to actually know about indexes and then run queries on them. We've tried this two times now in the old system and we failed, third time's the charm. And actually in the new engine, it should be much easier now. Because we clean up the index API, we, it's easier to debug, so this is not impossible, but it's not trivial. Okay? Any questions? I went through these very quickly. Okay, now I'll post all these online. All right, how to start? Form your team, meet your team, figure out what you want to do. Sign up, mark that you want to work on this project, look over the source code, figure out what you need to implement and then when you come back from spring break, you'll propose this. I'm around during spring break. You want to meet with me and discuss things and if you're interested in the LLVM stuff or the recovery stuff, with the optimizer stuff, there's students still around here at CMU that built those things and we can get you in touch with them to help you figure out what you should be looking at in the code. Okay? All right, next class, let's talk about how to actually execute some queries. Okay? All right guys, have a good weekend. See you. Get a bounce to get the 40 ounce bottle. Get a grip, take a sip and you'll be picking up bottles. Ain't it no puzzle, I guzzled cause I'm more mad. I'm down in the 40 and my shorty's got sore cans. Slaps and sick packs on a table and I'm able to see no shorts with the cost. You know I got them. I take off the cap, I first untap on the bottom. Throw my three in the freezer so I can kill it. Careful with the bottle, baby. Oops, don't spill it.