 All right, so this lecture I'm giving today is actually new for the semester, but it's something that I've been interested in for a while, because certainly we've been struggling with this in our own system to build out the networking layer. So just real quick, the reminder for everyone, the major things that are coming up for you guys are in class on Wednesday this week, two days from now. We'll have the midterm and design again to take the hour and 20 minutes. Then during spring break on March 12th, the skip list will be due at midnight. And then the first Monday after spring break when you come back, we will discuss, you guys be presenting your proposals for the project. So for today's lecture, I'll spend the first half talking about the paper you guys read on the networking protocol, and then I'll finish up by going through a bunch of different topics that you can choose for project number three. This is obviously not an exhaustive list, but these are the ones that I think are interesting and are doable within the time we have left in the semester. So if you have any questions about the things I'll talk about, you can send me an email, I'll have time to meet maybe next week or early in this week to meet the group and discuss things, otherwise we can correspond over email. And then I'll send out a link to on Piazza tonight to a spreadsheet on Google Docs where you can list your project group and then what your project topic is because we can't have two groups picking the same thing. All right, so for today's class, again, the focus is gonna be on the networking protocol, the database system. How do we actually have our client interact with the database and send it messages and give back data. So first we'll start off talking about database access APIs, and then we'll expand that and talk about what the actual network protocols look like, and that was in the paper that you guys read, and then we'll finish up with optimizations to minimize the overhead from the operating system to be able to send network messages. So these are called kernel bypass methods, okay? All right, so the database's access API is essentially how the program's gonna interact with the database. And so in all the demos that I've given in this class so far, when I open up the terminal and I'm connecting to my SQL or Postgres, these are all through the terminal. And essentially what's happening here is I'm writing SQL queries on the command line, I hit enter, then it sends a network message, the data system executes that, sends back the result, and we print it out to standard out on the terminal. But obviously nobody writes programs going through the terminal, right? Because that would be really, really slow to parse text from the terminal output. So instead, real programs that are user database will access it through some kind of API. This will allow us to have a programmatic way to send queries, get back results, and then incorporate them in our program logic to do whatever it is that we want our application to do. So the three essential ways that you can essentially do this are to use a direct access API, and this is gonna be something that's very database system specific, like think of like SQL Lite, you can open it up inside your application and you can make calls to a CE API or whatever binding you're using to invoke queries and get back results in the database system. And then there's these two other standards, the ODBC and the JDBC standard, where these are designed to be sort of universal APIs that every single database system can support so that if you write your application using ODBC for DB2 with very little change in theory, you can then have that application get ported to my SQL or Postgres. And I say in theory because as we'll see in a second, the ODBC API is essentially how you define what commands you want to execute at a high level. Like I wanna execute a query, I wanna connect to a database, but it doesn't say what exactly the query you'd be sending it, which is gonna be SQL, and that may be different based on what kind of database system you have. So it's a high level programmatic way to do this. So it started with ODBC in the early 1990s. Prior to this, all the different database systems had their own libraries that you would use to then invoke commands to your database system. So this made all our code really unportable because again, the DB2 library would be totally different than the Oracle library. And so there were some attempts in the 1980s to come up with a standard API. And there was one other competing one at the same time from other database vendors. But for whatever reason, the Microsoft guys and this company called Simlet Technologies, which is still around, their proposal for the open database connectivity API, ODBC, this ended up being the standard that all the different database companies then adopted. So pretty much every single major relational database is now, in some cases some of the no SQL databases, they will have a ODBC driver. They'll have an implementation of the ODBC API that you can then link into your program and invoke queries to or interact with the database system. So the basic model of ODBC is a device driver model. And so that basically means that the database vendor has to provide an ODBC driver that will have all the logic that you need to convert whatever commands you have in your application code to then be able to send the request over to the database system and get back to result and then present you that result in your programming language. So the way to think of this is like, this is like your Python library, C library, that you would then write in your imperative program to then open a connection to the database, send a query and get back to result. And what's gonna happen is the ODBC driver itself knows how to take the standard ODBC calls and then convert them into the database system specific of wire protocol that the database system expects to execute. So at a high level, again, ODBC is just, again, a programming API, but this part here, what we're gonna focus on today is what we call the wire protocol. This is always gonna be database system specific. So the type of things you can invoke with an ODBC driver, I mean the things you would expect there to be, system discovery, connection, disconnection, and then you can send any SQL queries you want over that. You get back the results from the database server and then the driver knows how to take whatever format those results are in and convert it into the format that your application expects it to be in. So it's Python, it needs to be Python integers or Python strings with C and DBC integers or C strings. So the driver does all that for you. One thing you can also do with an ODBC driver, you can have the driver actually emulate certain features or functionalities that are defined in the ODBC API that your database system may not actually support. So an example would be cursors. So a cursor is basically, you want to select statement and instead of getting back all the results at once, you basically get like an iterator and you can say get next, get next, get next. And what could happen is you could have the cursor be stored here on the database system, especially if it's a really large result and any time you say get next in your application code on the ODBC driver, that would then cause another request to go over and you get back another batch of results. But if your data system doesn't support cursors, which you can then do is just send the request, get back all the results, have it sit in memory in the driver and the driver pretends that it has a real cursor, right? So this is all hidden from you from the application that's all handled inside the ODBC driver. So now, in the Java world, they're not gonna have ODBC, they have something called JDBC. And basically, Sun in the 1990s recognized that since we want Java programs to be running in the enterprise and wanna be able to connect to databases, they propose their own standard API, a programming API for connecting to the database systems that they call JDBC. And it's heavily modeled after ODBC. So the way to sort of think about this, ODBC is for C programs, and then the JDBC is for Java programs, but a high level, they're basically doing the same thing. But what is interesting about JDBC and it's worth mentioning is that they supported a bunch of different ways to actually implement your drivers. So the first way is just you just have your Java code, the Java JDBC implementation, is just a wrapper around ODBC. So through JNI, the Java code would then invoke the C library, ODBC, which then knows how to send out the messages over the wire protocol to the database system. The alternative is you can have a native API driver. So this would be in Java code, instead of evoking ODBC, you invoke the database system, database system specific commands directly on the database system. Whether that goes over the network or not depends on the implementation. The third approach is to use a middleware. We basically have some service running on another machine, and your JDBC connects to that, and then that middleware knows how to take your request and then convert that to the appropriate database system request. So these guys, this one here would be embedded in the same JVM as your program that's running. This would be an outside process. And the last one is to have a protocol driver that's implemented entirely in Java. So it's running native Java code, and then it invokes whatever the wire protocol you need to have for the database system. So in this case here, I think this is the most common one for the major systems. But at least every single database system, if you actually want to have it be usable, you need ODBC. And then whether you have one that's written in Java or not depends on whether you have time to actually implement this. Yes? What's the difference between the numbered data number four? So this question is, what's the difference between number two and number four? So this would be like, this would be like invoking the wire protocol, and this would be calling directly into the database system. So if you were on the same machine, you could then send IPC requests to the database process. Is it implied that the number two only can run? This question is, does this imply that number two can only run on the native machine? I think the example I gave, yes. I don't know whether that's required though. I don't think, again, I think this one and this one are the most common. The first one and the last one. Actually, although the first one, I don't think it's supported anymore. So this one's definitely the most common in the major systems. All right, so the thing we want to talk about though is what is this? What is the vendor specific database protocol? So the way to think about this, all the major database systems will support their own proprietary wire protocol that they'll communicate with the client between the client and the server over TCP IP. As far as they know, I don't think any database, especially if you want to support transactions, will support UDP because you have no guarantee that those packets are actually gonna show up. And if you do, you have to do the actual work on the user level to make sure that it actually happens. So everyone's gonna connect it over TCP. So a typical client-server interaction would be basically the client connects to the database system. There may be an SSL handshake, so you can establish a secure connection, but then you begin the authentication process. Then the client sends a query request. The data system takes that query, executes it, takes the results, serializes it into the format that the wire protocol expects, sends that back to the client. The client then deserializes that and then admits it to the application in the format that the programming language, the programming environment expects there to be. So all of these back and forth, this is the wire protocol portion of what we're talking about in today's class. How do we take, what is the request that comes over? How do we take our result, package it up and send it back? So the paper you guys read was really focused on essentially this part here. How do you actually send back the results? Because that's the most expensive thing. Sending a single query, I mean it's not that big, right? Most SQL queries are a few kilobytes. Sometimes they're really, really large, but even then there's not much, it's just you're just sending SQL itself, or text, you're sending strings. So there's not really an optimizations you can do too much for that. So it's really, the focus on today is really this. So as I said, all the major database systems have their own wire protocol and they're not compatible with each other at all. And so this is what ODBC solves. ODBC masks the complexity of this wire protocol so that you can write your application and just only worry about implementing, I guess, ODBC and not worry about all of this, what do the actual packets look like? The driver takes care of all of that for you. One thing I do see though in newer systems is newer systems don't actually implement their own. Wire protocol, right? It's not easy to do and then you have to support it and then you have to support it in your drivers. So it's very common now in newer systems to actually pick one of the open source systems and just implement whatever wire protocol they have. And the advantage of this is that you basically get all of the client side ecosystem or the drivers for free. So Postgres has drivers of Python, for C, for any possible programming language you can think of, they have bindings or libraries for it. And so if you implement the Postgres wire protocol, which we try to do, then you don't have to implement those drivers all over again and support them. So it's very common in startups in newer systems that this is the route that they go. But the one thing to point out though is just because your database system speaks the wire protocol of another open source system, it doesn't mean that you're automatically compatible with them, right? So you can send packets back and forth but what actually what goes on inside those packets can be totally different or totally not supported in your new system. So for example, if you support Postgres for example, a lot of tools will inspect the Postgres catalog to figure out what data or what tables you actually have. So if your catalogs in your new system doesn't look like the Postgres catalogs, then the application that connects to it, it'll be able to send requests because you speak to the Postgres wire protocol but it doesn't know how to, you can't get back the results you expect there to be. Same thing with SQL. There may be a SQL feature that Postgres supports or MySQL supports that your new system that speaks to the wire protocol doesn't support. So I poked around a little bit over the weekend and the paper talks a little bit about this but as far as I can tell, the two prominent protocols that everyone implements is MySQL and Postgres. For MySQL, MemSQL, Clustrix, ActiveDV, and TIDV out of, in China, these all speak to MySQL protocol and in the case of Postgres, Redshift, Greenplom, and Vertica, they speak the Postgres protocol but this is no surprise because they're essentially derivatives of Postgres. So Greenplom took Postgres 8, rewrote the bottom portion of the system to make it parallel and on a column store. Same thing with Vertica. So this is no surprise that they speak the Postgres protocol because it was a Postgres begin with. Same thing with Redshift. Redshift is just originally was Park Cell which is also based on Postgres. Amazon bought a license for it, repackaged it as Redshift and so it's based on Postgres so you get the Postgres protocol and since then they've rewritten a lot of it so I don't know how much of Postgres is still in there. In the case of Hyper, Cockroach, DB, and our system, Peloton, these are all written from scratch. So we look at the Postgres wire protocol, my students that are here had opened up Wireshark, figured out what the packets are and we implemented that. There's actually a lot of great documentation for Postgres, it's not trivial to do but there was documentation for us to implement the Postgres wire protocol fairly easily whereas in MySQL I think it's not as clear. So again, the advantage of this is that you get the drivers for free but you still have to support all the extra crap in the system like catalogs in SQL to make it actually appear as if it's Postgres. So this is the one of the things that MemSQL did right when they first started their company is that they went out of their way to make it look as much like MySQL as possible so that it could be a drop-in replacement for MySQL. So they didn't have to change in your application code, you'd point it to your MemSQL cluster, MemSQL machines and it looked like MySQL and it ran so much faster. And I think from a business side that was really smart. All right, so for today's class from the paper you guys read, I wanna discuss four aspects of the wire protocol that you can actually implement that can, that will affect performance a lot. And I really like this paper that you guys read is out of the monadibi guys in CWI in Europe. The reason why I love this paper because this is a paper that I wish I wrote, right? This is something that I've been thinking about for a while and like, man, I'm disappointed that I didn't write it and they did but I'm glad that somebody actually wrote it, right? Because there's no other paper that goes in this kind of detail and sort of breaks down what's going on with JDBC and ODBC and how it affects performance. So I think this is one of the best papers that I've read in recent time. So hopefully you guys enjoyed it as much as I did. All right, so the things we wanna discuss are row storage versus column storage, compression, data serialization and string handling. So another thing I'll say too is a lot of these topics sounds a lot like the things we talked about before for database storage, right? We talked about compression, we talked about row storage versus column storage, right? So this is, a lot of these things are exactly the same thing we had to deal when we talked about executing queries on our database but now we're talking about how to actually send data back to the client. So the first issue is whether you wanna have a row store or column store layout in the result that we send back. So the ODBC and JDBC are inherently row oriented APIs. So what does that mean? So it means you execute a single query and then you have this cursor or iterator where you call basically fetch row or get next row and you have a while loop that says keep doing this or keep repeating my loop until I run out of rows. So you're processing things inherently on a row by row basis. And this makes sense from the time that these APIs were developed in the 1990s because back then databases were primarily used for business applications or transaction processing workloads where you're clearly operating on a row by row basis. This is before people really started doing big learning or big data and machine learning on our large data sets. Back then databases weren't that big. So the programs you wrote for them were these business applications where you would wanna go over things row by row. But as I discussed in the paper, now with people wanna do data analysis on your database, the data that you collected, all these data analysis frameworks don't run inside the database system. Things like Spark, TensorFlow, Torch, all these things are external to the database system. So that means we need to get the data out and then put it into those frameworks. So in these frameworks, they are inherently, in many cases, the underlying storage of how they represent data or matrices is as column stores. So having to get the data out on a row by row basis, then package it up back into columns and then put it into TensorFlow is gonna be slow. So this is sort of the problem that they're trying to solve. How do you do bulk data extraction out of the database and put it into a form that these programs can decipher very quickly? And so the solution that they propose is exactly the solution that we saw when we talked about query processing models is to do it on a vectorized approach. Instead of having things sending back as a single row at a time or a single column at a time, you'll send things as back as batches of rows but internally that we represented as columns. And that makes it easier for the data analysis program, the ODPC driver, then take the data out and then put it into these other frameworks. So again, this looks a lot like we talked about before with query processing and query execution. The column store approach is better for large-scale data analysis, but because we want to send things over the network, they're proposing to do things as vectors. The next issue we have is compression. So again, we had this big trade-off when we talked about compression a few weeks ago. We can use naive compression where we just take the block of bytes we want to send back and run snappy or GZIP on it and then send that back over the wire. And then on the client side, you just have to unzip the entire thing. Alternatively, we can do a specialized encoding on the columns of data, things like run length encoding, delta encoding, a dictionary encoding. And for this, we can take advantage of the skewness and the repetitive nature of the data we're sending back and give better compression scheme. So what the paper talks about is that essentially what happens is that the naive compression scheme actually works out to be the best approach because it's agnostic to the layout of the data and you don't have to do any extra work to figure out what's the right compression scheme I want to use. So they argue that essentially you just take whatever you have and just compress it using snappy and that'll be good enough. They do discuss about like if your network is slow, then maybe you want to use GZIP or XZ, something more heavyweight, if it's much faster than something like snappy or Z-standard would be better. And obviously you get better compression ratios, the larger your chunk sizes or block sizes are and so they discuss the trade-offs between these different things. So again the main thing here is that naive compression is probably gonna be the best approach. All right, another big issue is now how are we actually gonna serialize our data and send it back? So there's essentially two ways to do this. The first is that you can have a binary encoding where you basically take the data you have and you code it in this binary form and you put that in the packet and you send that back. The tricky thing about this is like if your database server is running on a CPU that has one type of endian and your client runs another type of endian, on the client side you need to make sure that you flip the bits around so that it comes out to be in the correct form that you expect it to be. Now most of the times everybody's running on Intel's CPU so this is not really an issue but you could be running on an ARM or Power and it could be something different. So you need to be able to handle that. I forget whether they said the client driver always handles this or the server side handles this. I think the client driver always handles this but you need to know what you're getting and what you're returning. Another thing they talk about too is that the closer that the serialized format of the data you send back in your packets, the closer that is to the form that the database stores things internally then the less work you have to do because now you just take the bytes that you're getting back from your result and put that back into exactly the form that you go in the packet and you don't have to do any actual work. So we have this problem in our own system because we need to support the PostgreSQL Wire Protocol we don't actually store data exactly as PostgreSQL stores it so we have to do an extra transformation step to put it into the correct form. Now I don't think it's a major overhead, right? But it's something we have to do in order to make it work. So if we had modified the PostgreSQL Wire Protocol to natively store data directly into the packets then we would have to go change all the client drivers to take that different data. And we obviously don't want to do that because that's a lot of work to support everything. So there's basically two ways to get binary encoding. You could have the database system write its own serialization API to store the data out to the messages or you can rely on existing libraries like Google's protocol buffers or Facebook's thrifts to take the data you want to send back package it up into some kind of message structure it has its own serialization format or serialization command to convert that to the bytes put that into your packets and then you send that back and then on the client side you basically have to do the rest of this. So they argue that implementing your own is always the best way to go because these things like Protobuf for example are these general purpose libraries that do a bunch of extra stuff you may not actually need to do if you want to send back data over your wire protocol. So Protobufs have this extra things where they keep track of the different versions or the version ID of the message that you're sending back so that the other side knows what schema it should be using when it decipher the bytes. So there's all the extra overhead that you may not actually need if you control both the client side and the server side. You know exactly what the wire protocol should be and you can conform your data into that. The alternative is to do basically cover all your binary values that you want to send back just convert that into text, right? Like run two string on it, package it up in your packets and then send that over to the client and on the client side they can then do reverse of that and put it back into its binary form. Right, so say I have a integer that's four bytes one, two, three, four, five, six when I want to send this over the network I'll literally make this a string that has the characters one, two, three, four, five, six send that over then the client knows how to reverse that and put that back into its form. Right, the advantage of this is that you don't have to worry about anything about any in this because it's characters. So whatever the reverse of putting the text string into binary form, right, S2I on the client side it'll know how to handle the proper order of the bits. The downside of course is that this is gonna be much larger for really long integers, right? So in this case here I know I have to store at least six bytes for six characters but I'm probably also gonna need to store the length to know how many bytes I have or a null terminator so it's probably more than six bytes. So even though it's the same amount of data I have to send more of it. So most systems implement this, the binary encoding I think Postgres can switch down into this text mode is that correct? Yeah, both. It does both and then mode ADB does this as well. But you get better performance for this one here but you have to do more work. Question? What's your relationship between serialization and compression between characters first? So the question is what is the relationship between serialization and compression? You would do if you're, say you do a naive compression, right? Just run a snap of your G-Zip on it. You do this first then you have your byte blocks and then you can compress that. So the bytes for serialization come from compression? The bytes from serialization come from compression? Yeah. Is that true? Is that what you're saying? Yeah, you would serialize it first then you compress it. This doesn't make any sense. Like if I compress this string, I have a bunch of random bits, right? I can't make any sense of that. But you don't have to get the idea of what you're going to transfer in to the network, right? You just want to serialize it and get the data back. So your question is, your statement is Does it matter which one comes first? So think of it, there's more going on here that I'm showing a single value, but think of it like you have a row and it has 10 attributes and I want to send that row back. How do you actually kind of represent? Like here's the first value, here's the second value. You need like delimiters or you need a way to say I know that this offset I have this and that offset I have that. Now in our database system, we know what those offsets are because we have the schema, but we're not going to send that back over the network, right? So we need a way to break things up because again, the schema we have in our catalog to decipher the bytes in our tables is not going to be exactly how they, what our query is generating. Do the compression happens after serialization? Correction happens after serialization. You have a giant byte array, for naive compression, you have a giant byte array, you compress it and then you send that back over as a blob. For our columnar encoding, you do that before you, if you're trying to have a compressed encoding like RLE, that's part of this process here. Yes. Okay. All right, so the last thing is how do we handle strings? And again, this is exactly the same issues we're going to have in storage of the database system. So there's three approaches. So the first is that you can do null termination and this is how they do this in C and C++. We have our string, a bunch of ASCII characters or byte characters and at the last one, we're going to store this null character to say that this is the end of the string. So what'll happen is on the client side, you don't actually know where the end is, right? You have to scan and look at every single character until you find this thing and then you know you're done. The alternative is to do, to prefix the string with the number of bytes of that string. So the first, whatever, four bytes or depending on how you encode it, would say here's my length and I know from that point at the end of that length to some other offset, that's the actual string I'm sending back. The alternatives do a fixed-width approach and basically you say I know what the size, the max size of the attribute is that I'm serializing on the server side and I make every string that corresponds to that attribute in every single tuple I'm sending back, I make that be that size and I pat out the suffix with the empty spaces to fill it out so everything's nicely byte aligned or everything's nicely aligned and I don't need to store the prefix and I don't need to store a null terminator. So I think they said that in the paper, I think in some cases this was faster, in some cases this was faster. And this thing was faster or better if the max size was small but that's not always gonna be the case because it depends on how things are stored in the table. All right, so I wanna show two graphs. So I know in the paper they propose this sort of vectorized compressed version of how to send data back. I don't wanna cover that, I just wanna show what the overhead is or what the performance is for the existing protocols that are out there. And so we got this data from the guys at Monadb. So they had eight different implementations, eight different systems they would compare against. They have Monadb, which is the system they built, MySQL with and without compression, Postgres, Oracle, DB2, Hive, which is like a wrapper. So SQL front end for HDFS and Hadoop and then MongoDB because MongoDB actually has an ODBC driver here. So for this first experiment we wanna do is we just wanna measure how much time it takes to send one tuple. So you have a single select statement and to grab one tuple and how much time does it take to actually send that back. And the way, and then on the ODBC and the client side they didn't actually generate any result they just threw the data away. So they didn't actually do any processing. It's just how much time does it take for the server to serialize the result and send it back. And then running this experiment with the client and the server on the same machine. We'll see in a second. You may have to go through the operating system for that and that's gonna be slow. And so the green measuring really was the overhead of the serialization approach on the client side for the different protocols. So what you see going across is that the fastest one actually ends up being my SQL. So these are in seconds. These measures are in milliseconds. The other thing I'll say too is the way they, the way they minimize the overhead of actually query execution is they would run the query multiple times, have it everything be cached on the server side so that when they ran the query again to do their measurements it wasn't like it was going through the whole parsing and planning and optimization phase. It was just how to get the result back and send it back over the wire. So the two surprises for me was how much slower DB2 was. In the case of Hive, I don't know whether this is because it's Java or it's this HDFS thing or a Hadoop thing but the protocol was definitely more heavyweight than all these other ones here, right? And Mongo did decent. So now what the next experiment they did was they, actually nothing point out too is that in this case here, Monadb was doing the text encoding whereas all these MySQL and Postgres were doing the binary encoding. So there's some cases where the binary coding was much faster. In other cases where it was actually slower. All right, so there's the next experiment here. We're now going to transfer a million tuples and this was sort of again the problem they were trying to solve is how can we do large scale data extraction efficiently having to go through a client protocol like ODBC. And so for this, think of this as like just running select star with no where clause and you're trying to see how fast you can stream the data out. So I wanted to break up the presentation to be able to show you the compression numbers first and then everything else, but I can't do that because Monadb comes first but along the x-axis, they're going to increase the amount of latency in the network and then the y-axis is just how long it actually took. So of course in no surprise, in the case of Monadb and for all the systems as you increase the latency, the time it takes actually to get the data out goes longer, right? Because the network is slower. The one I want to point out though is compressed my SQL where using a heavyweight compression scheme, you ended up sort of amortizing that cost of even though the network got slower, it essentially was always going to be the same, right? And so to me again, this is just because there's just sending less data and it's the computational overhead of doing the compression is what ends up being the dominated cost here. And then for everyone else, as you increase the slowness of the network, the performance gets worse. The other one that also was surprising here is Oracle starts off being the second fastest on a fast network but then as you increase the network, latency ended up going, being slower. And I forget why they said this was the case. All right, so any questions about this? All right, so the network wire protocol implementation inside the database system is not going to be the only source of slowdown in our data system when we want to send data in and out. As I said multiple times throughout the semester, the operating system is our frenemy, right? It causes problems and one big problem we're going to hit with is in the TCP IP stack. So when we want to send messages over the network, we have to go through the kernel and it turns out to be really, really slow. The reason is because the way the operating system is going to implement the TCP IP protocol inside of itself is through interrupts with context switches. So when a message shows up, you get to interrupt to say, hey, for this NIC, I now have a packet, do a context switch to some other thread and it can go in and process that. Then we want to get data out at all different levels. The system from the time it goes from the actual NIC, the Harvard device, to when it shows up in our beloved database process, that data's going to be copied multiple times. And that means memory allocation and memory allocation means that you take latches or locks inside of the OS to get memory. And then in general, because the kernel's multi-threaded, there's going to be all sort of latches inside of the kernel to protect the various data structures that it has. Because you have context switches, because some thread's going to go off and handle the interrupt to take our packet. So again, we need the OS to sort of survive, but we want to maybe avoid it as much as possible when we want to send our network messages. So this is what the kernel bypass methods do. And so the idea is that with kernel bypass, that we're going to allow our database system to be able to get packets on and off of the hardware device, the NIC, the Ethernet card, without having to go through the kernel. And so again, the advantage of this is that you don't have to have any context switches because your thread and your database processes can go down to the NIC and get data out directly. You're not going to do any unnecessary data copying because you're literally going to get the buffers off of the NIC and put it directly into your database process. And you can share memory between the two of them. And then because we're not going through the OS TCP IP stack, again, we don't have to worry about the multi-threading issues that can happen inside the kernel. So there's essentially two ways to do this. Or two approaches for this. The first is what is called the data plane development kit. So this is an actual library, a thing called the DPDK, the data plane development kit. So this is just an implementation of a kernel bypass method to get to the NIC. And then another approach is you remote direct memory access. And this is a highly reconstructed or a concept that you can use to access memory remotely. So this is like a library that you can download called DPDK, and then this is more of a technique. Other than it's calling it's kernel bypass, but this is technically kernel bypass too. So I didn't know what to call the two of them. So the DPDK is a set of libraries, I think originally written by Intel, but then now it's been open sourced and it's part of the Linux foundation. Essentially, these libraries allow your program to access the NIC directly. So you basically have to do a bunch of extra work in your database system code, which is okay because that's what they pay us to do, that will essentially manage memory ourselves. And we then pass these buffers to the DPDK and DPDK can then put packets, the raw packets we get off the NIC into the buffers. We bring them up to our process and we can process them, or our database system, we can process them just as if we got them from the OS. So you're skipping the TCP IB stack in the operating system entirely, right? So again, the advantage of this is you don't get any data copying, you don't have any system calls, you don't have to handle any interrupts in the kernel, right? We basically go around the kernel to get data on and off the NIC. So as far as I can tell, there's only one system that actually implements this approach or uses this, right? It's this thing called ScaliaDB. ScaliaDB is a C++ implementation of Cassandra. So they basically took, re-implemented the Cassandra wire protocol and built their own system that looks a lot like it. But everything runs in C++ and then they have this library called Cstar that uses, excuse me, the dpdk to access the NIC directly. So this question is, how portable is the dpdk? As far as I know, it runs fully featured in Linux and then they have a partial implementation for FreeBSD. I don't know if there's an equivalent in Windows. The other issue, question I don't have the answer to is, can you use the dpdk on a virtualized environment, like on EC2? Because think about it, if you get an instance on EC2, you're running multiple VMs on the same box. I don't know how the kernel bypass method works for that, right? And I asked the Intel guy, he didn't know. I suppose we can try, but it runs on Linux. What else would you want, right? And it also, I think it only runs on certain NICs, because it's like sort of the ODBC, you have to have a dpdk driver that can then support this library. So think of this dpdk as like ODBC, but going after the bare metal hardware, right? And then for ScaliaDB guys, they've used their dpdk implementation or network stack for ScaliaDB, and they've also used the memcache version of it, and they show it being like 2x faster. And again, 2x faster just by avoiding the OS is quite significant. Yes? So this question, is this still reliable transmission like TCP? I think so. I think it's dpdk. Yeah, yeah, if you have dpdk, I think so, yes, it's still reliable. Because I don't think it's at the lowest layer, right? So the retransmission is before you get to all this other stuff. All right, the other kernel bypass method is do RDMA. So with RDMA, it's basically you have, say you have a distributed system, you wanna read and write memory to another machine, and you still have to go to the network to make that happen, but when you land on the other side, the server, the NIC knows how to go into memory and read that value or write that value you want without having to set a packet up into the operating system and the operating system do the right. So the hardware can sort of do the right, the read and write to memory by avoiding the kernel entirely, right? So to make this thing work, you need to know what address you're reading and writing from on the client side in order to make that request on the server side. And then the other thing is that the memory access is completely transparent to the machine that's being accessed. So on the client side, if I do a read and write into memory, there's no notification, there's no callback, there's no event that gets triggered on the server side to say, hey, by the way, someone wrote to you at this memory location. So the tricky thing about this is now you need to be able to manage, in your system, you need to recognize that, oh, in the same way we did before, we had handled different transactions updating the same tuple at the same time without having to coordinate each other, we essentially need to do the same thing. And I don't know for RDMA, I don't know how, I don't know at what level the writes can be atomic. Question? How do they remain directly accessed with the map? So this question is, how would you actually influence in a database system? Yes. So you basically, on the client side, you need to know what the address space is. Only the address has to still be transferred through the network layer. Correct, yeah. And all the way out. So there'll be a library, you'll use like an RDMA library that you make calls like a read and write and you pass the memory address and then that's what gets sent to the network. And the hardware has to be able to support those commands and know like, I need to jump to this address and read stuff. It has specific requirements for a hardware. So it's a requirement for what? For a hardware? Yeah, so your hardware has to support this. So there's this thing called an InfiniBand sold by Melanox, it's very expensive. You have to have Melanox NICS on both sides and a router to be able to handle this. And the following thing is question and it's sort of like probably intuitive that you can guarantee that without, look like why are you getting rid of the TCP and IP layer? So his question is, the earlier question from Tian Yi was how do you make sure this be actually reliable if you get rid of TCP and IP entirely? I think the low-level networking substrate takes care of that for you. Which means you have to take care of it yourself? No, I think the hardware takes care of it. Like the protocol and the hardware takes care of that for you. You have to run a specialized hardware for RDMA to work. I think you're RDMA over ethernet, I think that does exist now. But traditionally with InfiniBand, you had to buy the hardware. I don't think it will work either. It doesn't? Although they are based on... Right, so his statement was, I don't, yeah, you probably would, yes. So his statement was, he's right, that raw DPDK does not support TCP. If you want reliable transmission, you have to do that yourself. Yeah, so I think RDMAs might be the same thing, I don't know. So as far as I can tell, I don't think any, there's no major database system, like commercial one that supports RDMA. Oracle's InfiniBand might do this, or Oracle might do this for RAC and Exadata, like real expensive million dollar machines. But like, you download MySQL and they're not gonna support this. Microsoft has a really interesting project called Farm, where they basically showed how to do transaction processing entirely over RDMA. And I think you end up having to do like four phase commit, because again, you don't know when things, you don't know when other transactions read and write to data that's on your machine. So there's a bunch of extra steps you have to do to make sure that everything's running correctly. Okay, so just to finish up, I think the networking protocol is often overlooked bottleneck, performance, you know, when you think about the, I had you guys read that looking glass paper early in the semester, right? They talked about the buffer pool, they talked about logging, they talked about transactions. They completely ignored networking and they completely ignored what it takes to actually write data out to disk, right? And I think those things are actually major bottlenecks that are worth looking at. And then for the kernel bypass methods for networking, it's, you get vastly improved performance, but it requires more bookkeeping, it requires more work to be done as the database developers inside of our system in able to use these techniques and use these optimizations. And I suspect also too the case that the, at least in the short term, things like the DBDK and the RDMA are gonna be mostly useful for the internal database communication between, in a distributed environment, right? Meaning the, you wouldn't have to write, you know, a new client driver for Postgres that uses the DBDK, that would be handled on the server side. All right, any questions? In the back. Yeah, so for the serialization part. Yes. There are actually many parameters such as like whether compression is going, whether, like the amount of data that we are sending. So is there any implementation which has existed which is adaptive? So his question is, as in the paper shows, there's no one size fits all for all these different decisions in our design space. So is there any network protocol implementation in a database system that's adaptive that can then recognize, oh, my data looks like this for this result, I wanna use compression and my data looks like that, I wanna, you know, do something else? I don't think anybody does that. I think everyone just picks one and runs with it. Just uses that. But you can imagine, absolutely as you said, my data, I know my data looks like this and therefore maybe I pass along a little thing in the header that says I'm encoding it this way and the client will be able to handle that. Again, think about what, if you had to actually support that though, that means every single client driver would have to have to support all the different possible ways you can code things. Right? And that's why I don't think anybody implements something that's really complicated. You pick one and just use that. Okay. Project three. All right, so again, project two is due next week during spring break, but everyone should start thinking about project three. So this is gonna be a group project and probably just use the same group you had for project two. We're gonna implement some substantial feature or component or concept in our database system. And the goal is that I want you guys to incorporate the topics we've already discussed so far or what we'll be discussing in the future this semester in whatever your project is as well as whatever your interest is. So if you're really interested in networking because that you're doing for your own research or that's what you like to do, then you should probably try to pick a project for that does networking stuff, right? Cause I like it when students come along and they come up with ideas that things that I haven't thought of outside the context of databases and we can help us improve the database. So as I said, there'll be a sign up sheet on the Google Doc spreadsheet that I'll send out later tonight. I just list the members you have in your group and then you'll have to make sure you pick a project topic that is unique or different from everyone else. And you can go back and look at what's done in the, for the last two years I've taught this course and you should not pick anything that they did. Unless I say, you know, yes, go ahead and do that. I should not pick something that's been done before even if that code never merged into the system, right? Cause not all the projects make it into the full system. So what's expected for project three? So you have to do a proposal and that'll be due again on the Monday we come back from spring break and that's just you and your group come up here and give a five minute presentation about what you wanna do. And we'll talk about what should be in that proposal. I guarantee for these talks, everyone will have Mac laptops and the Mac laptops never work on this thing and you'll see that every single time. And then you do a project update about as we get closer in the semester. And then also during the semester you do code reviews with each other and I'll explain what that is in a second. And then on the whatever our date is for our final that's scheduled by the university that one will have the final presentations. We'll have pizza, we'll have soda and you'll come and give a demo what you actually did. And then you don't get a grade until you actually do the code drop. And the code drop is you submit a PR on GitHub and it can merge cleanly into the master branch without any conflicts, okay? All right, so the proposal. And we have about 12 or 13 groups in the class. So everyone gets five minutes. You basically come up here and you say, this is what I wanna do. Then you discuss things like how you're actually gonna do it, meaning like what files you think you need to modify or add, how you're actually gonna test and see what your implementation is actually correct. And then how you're actually going to evaluate your work to see whether it actually made sense or not. And this is why I had you guys write in your synopsis for these readings about what are all the different workloads that they're using to test their things. So now you know what benchmarks you have. And we have actually every, we have most benchmarks implemented and ready to run for you that you can use. Not all of them work or we can fix that over time. Then the status update basically is a few weeks before the end of the semester. You come, give another presentation to the class and you basically say here's what we're, here's how far we've gotten in our system. Here's what we had to change in our proposal because we didn't have this feature or this thing was broken, right? And then anything that surprised you during the process of like oh my gosh this one piece of the code was amazing and we could totally use this, another piece was terrible and we have to avoid it or rewrite it. So a big part of the course also, the course grade would be for the project three will be your code review. So what'll happen is every group is going to be paired up with another group and you'll submit PRs to each other and you use the GitHub API to do a code review. And the idea here is get feedback on what you're doing as well as to understand what other people have done in the system. So this'll be a big part of it going out in your career of working at software companies. It's not just if you write code and throw it over the network, throw it over the wall and someone's gonna pick it up and polish your turd, right? You wanna learn how to write high quality code and this is important in the database system because we want things to be correct and functional. So there'll be two rounds of reviews and last year people did this and I don't want people to do this. I don't want people to break up the reviews where one group member does the first round and the second group member does the second round, right? Everyone should be involved in both these code reviews and Prashant and I will look through and provide feedback about like, oh, this is a good idea, this was good, this is bad or what about this, what about that, right? And it's not me sort of forcing you to do this because I want you to do this. I want you to learn how to do this because this is gonna come up throughout the rest of your career of learning how to read other people's code, make sense of it and write meaningful or helpful code reviews. All right, and then for the final presentation, again, this will be 10 minutes on the schedule final exam and there'll be food and prizes. All the database companies send me their t-shirts and everyone will get a database t-shirt at the end of the semester. Pick your favorite company. And ideally, if you have a demo of your work, that would be a big deal. So last year some kids did UDFs, so they showed how they can run UDFs. Then there's always performance numbers of the various things they actually implement. So having a demo in the last day or even for the project update would be really cool and that gets people excited. And then these won't be recorded, they won't be on YouTube, so don't worry about embarrassing yourself, okay? And at the end, you have a code drop and the code drop is that you've addressed all the issues and concerns in your code reviews, you have test cases that prove that your thing is actually being correct. You'll see this if you have a PR, you set up coveralls, we compute the code coverage of your tests. So if you write a lot of code and your code coverage goes down, that's bad. Your coverage should always be stable or going up. And then you need to provide documentation of what you actually did in a separate markdown file. And in the order that we'll determine how we merge these PRs will be random because some people may conflict with another and they make it fair with everyone, we just, we may get random and then whoever comes before you, you have to merge their complex in your PR before you get a final grade, okay? So any questions about what's expected to you as for the project itself? Yes? So in addition to the source code and the documentation, is there any kind of report that you're writing? That's the markdown file, the documentation. Like what did you do? And to how, like how attentive, how long should that be? I mean, I'm not looking for pages and pages and pages. We actually have a template for you to use already to say how do you document what your schedule was. And in some ways too, the final presentation is like the report. Yeah. Question or no? So is there like an actual get-binding game by which the code is not that bad? Yeah, there's whatever it says on the website. Like it's like a couple of days before the I have to turn in grades. So some of you are graduating without grades, you can't graduate. So when we need time to make sure we get everything done. It's whatever it says on the website. Okay, so I wanna talk about some project topics you can look into. So there's a bit of a large list here. Again, I'll post this on the website and I'm happy to discuss anything in more detail. So we'll just go through all of these, okay? All right, so the first one is to work on our query optimizer. So I had a master student, a plucky master student in last year, write a brand new query optimizer from scratch. It's very impressive. And he wrote one based on the Cascades model which you'll read about after the spring break, which is a sort of a way to organize and do transformations in your query optimizer to generate an optimal query plan, essentially going from the SQL plan to the actual plan that David can execute. And so Cascades model is sort of the state of the art although it's from the 90s, it's sort of the state of the art of the way to do this. This is actually what SQL Server does in their system and you'll end up reading a paper to show that SQL Server actually has one of the best optimizers out there. So we have one of these as well. So the idea here is you wanna basically expand our optimizer that we have and we wanna support things like outer joins, expression rewriting, so they take up between clause and convert that to an or clause and then support nested queries. And so I wanna say, I'll say if you work on this project and I highly encourage you to do this. One, we have students around here that are still here, like Gus has been working on this, Boway worked on it last semester and he's still here. We have students around that can help you get started on this. And then you're gonna have to send me your CV because the companies are banging on my door to hire people to know how to work on optimizer. I'm dead serious. So here's one email I got in October from somebody that's a very famous data person and he has a data with his company. He's like anybody out there know any potential loose query optimizer folks that are in San Francisco Bay, it might be a board restless, like he's trying to hire a database optimizer person. And then another guy sent me a more vitriolic or profane email and he says, I hate you with all my passion, but before you die, I need to hire somebody who knows query optimization. Right? So again, you will have no problem finding a job if you work on query optimizers. So I encourage you to consider this, but it's not easy, right? It's a complicated thing. And, but there's enough things around here, there's enough people around here that we can help you. All right, the next thing we wanna support are schema changes. So think of like, add column, drop column, alter table, those kinds of commands. So we don't have these, we want them. So for this, the idea would be you have to support and add support in the SQL parser and the planner to be able to take an alter table command and actually use something with it and then sort of start easy like changing the column name, that's just changing the catalog, that's easy, but then adding the dropping column and changing the column type, these are things that are more complicated and we wanna add support for them. And I would say this is something we're also involved in a paper, if you're interested in research, that we wanna see how we can do this in an efficient manner using multi-versioned catalogs. So the idea here is I could add a column and not have to update all my existing tuples the way my SQL Postgres do. I can do this in the background or lazily and just know that I have an old version of a tuba that's running on an old version of the catalog and convert it on the fly based on how I need it. But again, so the very first thing that we do is just add support in the SQL parser and the planner and then you support chain column name, that should be the first two things that are pretty easy to do. Related to this, we also wanna be able to support adding and dropping indexes. So we can drop indexes now, that's easy, you just remove from the catalog and it disappears and you clean it up later in the garbage collector. But we can add an index where we cannot do it transactionally, meaning if we have a million tuples and we start adding an index but then we modify that table as we're building the index, we're gonna miss those changes. So the idea is that we wanna do this correctly. So I think I talked about this before when we talked about the catalogs, the way to actually implement this in a safe manner is you have this little Delta storage thing you prop up and that absorbs all the changes you make to the table while you're building your index and then you go back and rectify or reconcile those changes by locking the table and updating the index. What would be really cool if you can actually add support for building the index in parallel and actually say like I wanna use one thread, two thread, four thread or any arbitrary number of threads have them split up the work and do the scans and build the index, right? So that's one project. I'm also very interested in pursuing the cicada model for storing our indexes as data tables themselves. So right now, again, we have the BW tree, that's just everything stored in the heap, your skip list you guys are building, that's all stored in the heap. So as you see, you're doing all this extra stuff to handle garbage collection and concurrency in your skip list because it's disconnected from the regular tables themselves which already has these features. So the idea here is can we do what cicada does and store index directly as nodes inside the data table. So we can get a basic B plus tree from the hyper guys in Germany and the idea is then we can take that and then implement that in our system and see how this works. So there's a bunch of stuff you have to change like you need to have the index factory be able to support constructing an index and have it being backed by a table. You need to add support for fixed length binary attributes that are in line in the table itself because otherwise you're going to the var line pool which is again another indirection layer. So there's a bunch of things you have to fix in the higher level system to make this work but then the core thing would actually be taking the B plus tree and putting it directly inside the table. And there's another thing where I'm very interested in this as a possible research paper. So if you can be around for a bit longer than a semester this is something that we can talk about. Next one is sequences or auto increment keys. Again, this thing is a global counter that's stored in the catalog and then you say get me the next ID and you add one to it and increase that. And as I talked about before you have to be careful about this because it doesn't follow under the same transaction protections as regular tuples do when you modify them. Because you want to have transactions be able to both two transactions that are running at the same time both update this counter and not collide with each other or not abort because one guy right before the other guy did. So to do this we're gonna add a special case in the transaction manager to be able to recognize oh it's a sequence not a regular data table tuple and then allow multiple transactions to modify it. And then when we have the new right ahead log manager that we're implementing in place we make sure that this thing that this update gets logged out. At least before that you know when any transaction has read a value that has been in our sequence we make sure that that's getting written out before it's allowed to commit. So you want to add support for the next valve function which we can do now and because we started the pointers to the catalog pointers to function in the catalog. And then ideally if you add support attribute type which is just a sort of a syntax sugar for defining a sequence in a table. So again this is another one where I think there's a paper here I can't guarantee that but it'd be interesting to see if you have to implement the basic sequence implementation what can we do to actually speed this up make it run faster and compare what other systems do. For views so this says so view again is basically like a think of it like a virtual table you can have a view that's always competed on the fly you can have a materialized views that's incrementally updated. So one interesting project might be to actually pursue either one of these having materialized views or regular views. And for materialized views we can we can rely on the fact that we have basic support for triggers so that when you update a table you can then force an update to the materialized view. The dumbest thing for materialized view is every single time you update the table just rerun the query and catch the result. But there's probably better things or smarter things you can do. The next thing we wanna handle also is maybe do pre-compiled queries for our LLVM engine. So right now any single time you create a table and you execute a query on it the first time it sees that query it has to compile it and run it. But there's obviously some queries that we know we're gonna probably run ahead of time like insert queries or delete queries or basic select star queries that we can pre-compile ahead of time cache that in our catalog or statement cache and then when the query comes along we don't have to do that expensive compilation every single time. So where this actually matters a lot it would be in the catalog because the catalog basically is doing the same queries over and over again. Give me the row that has this table name. So we wanna be able to pre-compute these things as well when we boot our system up and store all these things. So this would give you, you would have to touch the LLVM code the catalog code and possibly the query planner as well. And one cool thing to consider is instead of compiling entire query can we actually compile bits and pieces of this and then stitch them together on the fly? So maybe I only compile the iterator or the where clause of a select statement and then I can then cache that and use that as a drop-in for any possible select statement I have that comes after that. For tile group compaction right now we never free memory, it's a big problem. So if I insert a million tuples I have to allocate all that memory to store that there's no way to get around that but then I delete a million tuples that the Peloton never gives the memory back. And so we wanna be able to do is compact our tile groups and free up space. So the easiest way to do, the easiest thing to do is if you notice you have an empty tile group and you have 10 or more already empty tile groups then it's okay for you to go ahead and just free that memory entirely. What'll be really tricky and really cool is that if you notice you have two tile groups that are half full then what you can do is put them into a single tile group combine them together and then free up one of them. And you can do this because we can mark tile groups as being immutable meaning we won't ever actually insert new values we'll only delete values or delete old versions. And so when we have half of the versions deleted and one tile group and half the versions deleted another tile group we can go ahead and compact them. And so for this you have to have a implement a new thread in our system the new background compaction which we currently don't have. All right multi-threaded queries so Prashant is currently working on a new feature now excuse me to add support for basic multi-threaded queries. So this is something we'll cover after the spring break but right now when a single query shows up one thread will execute it from beginning to end but what we can do is we know we have to touch a lot of data we can take that one query split it up into subtas have them run on separate threads and then combine the results at the end. So for this we'll be working with Prashant to expand his support for intro query parallelism with multiple threads we want to add support for index scans mark joins and other things in our engine. And if you really want to go buck wild we'll see this in the morsels paper for Hyper but one thing would be really kind of cool to do is have our threads be able to run tasks on data that's local to them. So if you're aware that I have my data my table split up between two sockets all of the threads that run on one socket access data that's on that socket and all the threads on the other socket access data the local to them. And you don't have to go over the boss between the two sockets. Again this is something we'll cover more after the semester. Data compression we've already talked about in this semester so far. So last year we had students implement delta encoding so it'll be interesting to see whether we can have maybe do dictionary encoding. We didn't end up actually putting the delta encoding stuff in place because it was too hacky and had other issues but it'd be interesting to see whether we can actually support true delta encoding this time. Another project I'm interested in is supporting temporary tables. So right now in SQL you can do like create temporary table it's basically an ephemeral table that gets blown away or deleted as soon as the client disconnects. So it ends up having to go in the catalog because if you have to treat it like a regular table and yet the query planner needs to know about it or you have to include it but then when the client disconnects then you want to throw it away. So for this you'd have to add support for the catalog, the binder and the planner to be able to recognize this. They don't generate the temporary table on the fly as needed. So I don't know how difficult this would be but I suspect there's a bunch of changes about the make up and the upper levels of the system to make this work. Finishing up we're also interested in adding support for the enum type. So an enum basically is like in C++ or Java you can define some labels for some fixed values and then they just get mapped to integers. So for this we have to add support in the catalog to handle the enum type, have to add support in the SQL parser and the planner that it'll handle this. A student here in the front who's been working on arrays so you're gonna store the enum as an array in the catalog so it'll rely on his code and then you have to add support for the LLVM and the LLVM engine expression evaluation to support enums as well. And the last one is to do alternative networking protocols. So right now again we speak the Postgres wire protocol but we have tried to architect our networking layer so that we can have different implementations of different wire protocols be supported natively and then they sort of hit the traffic cop layer where that's sort of the standard uniform API that those protocol handlers deal with. So I've been interested in adding support for Kafka or Memcache. So that you can write again, you can write Memcache code or Memcache commands that can interact with the database system and those just get converted into SQL statements that then operate directly on the database system. So for this need to overhaul the client communication handle the code, we've already sort of done that. I don't know how well it actually works. And then for Memcache we basically write get inputs into prepared statements. All right, so again I'm going through these very quickly just and I'll post this online or you send emails, you have questions. The idea is just showing you what I expect as the scope of a project. Some are obviously harder than others. So if you have any questions about like, is this really hard, is this really difficult or is this too easy, then send an email and we can talk about it. All right, so our goal is to expand our SQL based regression suite that Chen Yu wrote so that we can make sure that whatever you implement actually doesn't break any of our high level functionalities. Easier said than done, this is still a work in progress but this is my goal to have something running before you guys actually start making serious progress in your project. And then everyone's gonna have to implement their own SQL slots or Java or SQL test cases to make sure that their implementation works correctly. So for computer resources you should use the same MemSQL machines you use in the class projects. This should have an adequate amount of DRAM and CPU core so you could test your implementation. If you need special hardware, then please let me know. We can see what we can do about getting it for you. So for example, one year the students were implementing parallel logging and they needed a bunch of SSDs. So one of these machines have three or four SSDs you can use. So if you think you need something, let me know and we'll get it for you, okay? So for testing your implementation, we have this benchmark framework called OTP Bench that has already a bunch of built in benchmarking frameworks or benchmark implementations that are all on the paper you guys read that you can use. Now, unfortunately a bunch of these things are broken. I know that YCSB works and TPCH works while fixing TPC and TTP now but these are what you can then use for testing your projects. So we already have scripts in place that you set a few commands or set a few parameters to say where your server is and it'll run the complete benchmark for you, right? So no one should be implementing TPCC from scratch. I know there's a micro benchmark in the code that you could use, do not use that. Everyone should be going through the entire SQL stack using OTP Bench, okay? I did not update the date. This is wrong, ignore this. But again, everyone should be doing a five minute presentation after we come back from break. And I should be around next week. Send me an email if you wanna meet on campus or discuss things on Hangouts or Skype, okay? Any questions about project three? All right, so I'm dead serious. There's a midterm next week. One year the kid thought I was joking and he's like, oh, I make jokes all the time in class. He's like, oh, I thought you're joking about the midterm. Like, no, this is really, and so he didn't show up. There was really a midterm Wednesday. So please come, okay? And then after spring break, you'll do your project three proposals and then project two again is due on Monday next week. All right, any questions about the midterm? All right, guys, ignore this, the constraints. These are old stuff, all right. I will have office hours on Wednesday, right from midterm, but if you wanna meet before then, send me an email and we can discuss, okay? Mm, I need something refreshing when I can finish manifesting to cold a whole bowl like Smith & Wesson. One court and my thoughts hip-hop related. Ride a rhyme and my pants intoxicated. Lyrics and quicker with a simple moon liquor. Turn on my city slicker, play waves and pick up rhymes. I create, rotate, add a wave. Too quick to duplicate, feel a breeze as I skate. Mike's a Fahrenheit when I hold him real tight. Then I'm in flight, then we ignite. Blood starts to boil. I heat up the party for you. Let the girl rub me and my mic down for oil. Wrecking still turns with third degree burn for one man. I heat up your brain, give it a suntan. So just cool, let the temperature rise. To cool it off with St. I.