 Okay, so what's missing here on the slides is that I'll sign an announcement on Piazza. Given that we sort of people are struggling with the second project finishing that up, I bumped a bunch of other deadlines to give you more time for the later stuff. So homework four, we do like a week or so after it's normally do now, and then same thing for project three and four, okay? I'll post this on Piazza. For the database updates or the database talks coming in the next two weeks. So next week or this week we have on Thursday the guys from BlazingDB down in techs are coming up to give a tech talk and then Brightlight is out of London or Norway, I think, and they're coming on November 1st to give a talk. So again, I'll send an announcement about that on Piazza. All right, so today we're talking about parallel execution. So everything we've talked about so far in the semester has mostly focused on single-threaded operations. We did talk about doing concurrent updates or concurrent operations on the B-plus tree. But now we want to talk about how we're going to actually use queries in parallel, what the kind of optimizations we can do with this. So it sort of goes obvious and goes without saying, but why we do want to actually run things in parallel, right? Sorry, because we get performance benefits. That's the key thing, right? We'll be able to run queries more efficiently because now we can break it up across multiple cores and multiple storage devices and we'll get faster latency and better response times. The system would end up just being looking more responsive. And we'll see this when we talk about transactions. We have multiple threads updating the database. If you have to go get a page from the buffer pool and the buffer pool doesn't have it, you have to stall and go out the disk. You want to let other things keep running and still make forward progress, right? So this is sort of the state of the world that we live in now. Moore's law is over, our CPUs are getting more cores, we're getting more storage devices, cloud storage is cheap. So we want to take advantage of all these things inside of our database system. So for this lecture, we're going to focus on parallel databases. We'll have two lectures at the end of the semester talking about distributed databases, and I'll cover what the difference is between these two of them. But the high level way to think about parallel and distributed databases are that to the application programmer, the person writing SQL and submitting them to the system, a parallel database or distributed database is going to look like a single database instance, right? Like you open up the terminal and you start writing queries to it. You don't know that the database is split up across multiple cores or multiple machines, right? It all looks the same to you. And this is the beauty of SQL in a declarative language. You just say, hey, run this query and it doesn't matter whether it's a single node or 1,000 nodes, the database system can take that single query and figure out what the best way to execute it is for you. So that's the beauty of SQL. We need to take advantage of this now inside of our system. So the distinction or the difference between parallel databases and distributed databases that I like to make is the way to think about this is our parallel database is one where the nodes, and I'll use the term node loosely, the nodes are physically close to each other, right? A node could be a CPU socket, a node could be a whole another machine. But the idea is that these machines are close to each other and communicate very efficiently. And so because we're assuming that they're close to each other, then we don't have to worry about unreliable communication, right? So think of the most simplest parallel machine could be a single socket CPU that has four cores or two cores, right? That's a parallel database system. Distributed database typically are ones where the nodes can be far away from each other. So far away could be in the same data center across town or different geographical region, right? And in these environments, typically you're going to be connecting to them over a public network. If you're inside of Amazon's data center, then it's a private network, but it's still not the same as two sockets in the same box together. So in a distributed database, we're not going to be able to ignore the communication costs, right? Of sending data from one node to the next. And it also means that we can't assume that the communication is reliable. We'll see this later on in the semester when we talk about doing transactions over distributed database. This changes how you actually set up some of your algorithms because you can't assume that I send a message to somebody, they're going to get it in the right order that I expect them to get it. And so for today's class, we're focused on parallel data. We're not going to, we don't have to worry about that problem at all. The question is really how can we take advantage of additional resources to execute our queries more efficiently? So the two types of parallelism we're going to talk about are interquery parallelism and interquery parallelism. So interquery parallelism is essentially what most people think about in parallel databases is I can run multiple queries at the same time and they're going to run on different threads or different cores, right? And then we do this because we want to improve our throughput latency. We don't want to have everyone stymied or blocked on this one thread that's executing queries one after another. Interquery parallelism is where we're going to take the operations of a single query, I think of a single query plan. And we're going to execute those operations in parallel. This either could be, and we'll see this in a second, be a horizontal parallelism where we're taking the same operator for executing in parallel multiple threads or vertical parallelism where we can execute two different operators on two different threads at the same time. And so this would be more useful in analytical queries or long running queries. We have to scan a lot of data and crunch over a lot of tuples, right? Because you're usually going to be IO bound or CPU bound in these environments. And so by having additional parallelism, you get better performance. All right, so to begin, we're going to talk about different process models and way to design a parallel database system. Then we'll talk about how you execute operators in parallel. And then we'll finish up talking about how we can build our system out to get IO parallelism, okay? All right, so the data systems process model defines how the system is going to be designed to take advantage of additional resources and execute multiple requests from the applications in parallel, right? And so instead of using the term thread or process or core, I'm going to try to use the word worker, right? Because that can really mean either a bunch of different entities executing something in our system. So the idea is that we're going to have some worker in our database system that's going to be responsible for executing tasks, and I'll define what tasks are in a second for our queries. And we want our workers to be able to run in parallel, right? So the three ways you can architect a data system is that you can have a single process, like an OS process per database worker. You can have a process pool where you can dispatch work to those workers. And then you can have a single thread per database worker. And there's trade-offs for all of these. So the first one is process per worker. And again, the idea is that a single OS process corresponds to one worker. Typically, it can be a single thread for that process. And we're going to rely on the operating system scheduler to decide what process runs next or which one has higher priority over another. So the idea is that, say this is your application, you have something in the front called like a dispatcher or coordinator or Postgres calls this the postmaster. And the application sends a request to the dispatcher and says, hey, I want to execute this query. Tell me who's going to do it for me. And the dispatcher will hand this request off to a free worker. And then the worker comes back and tells the application, hey, but I got your request. Here's my socket if you want to talk to me. Here's where you can send me all the requests that you want me to execute. And then now it's a responsible for the worker to interact with the database, to execute your query, and then send you back results. So this was the approach that most systems were using back in the old days, the 70s and 80s. This is what IBM DB2 does, this is what Oracle does, this is what Ingress does, this is what Postgres does, right? Let me take a guess why this was the most common approach in the old days. He says the CPU is only one core. Think of why would you want to do this instead of threads? Maybe put it that way. Think of the 1980s, right? Was Linux around? No, was POSIX as prevalent as it is now? No, right? So there was no P threads, right? There wasn't a standard threading library. All these different versions of Unix had their own threading packages. So if you wrote a multi-threaded application for one operating system, like BSD, and then you want to run on Vax, you probably can't use the same threading package, right? But the basic concept of a process that would exist in all these different operating systems. So this is why most of these systems in the old days were implemented this way. So an extension of this is use a process pool. And the idea is that you still have a dispatcher, you still have something in the front, but it's responsible to just hand things off to whatever worker is free, and then you can go ahead and execute it. This is sort of a better implementation of the single worker per process. And this is used by IBM DB2, and it's also used as of 2015, the latest version of the Postgres now sort of use this sort of model. They can have one worker communicate with another worker to hand it off work. The most common approach that is used today, and if you're building a new database from scratch, this is probably what you would use, is you just have a single process for your database system, and you just have multiple threads. And internally, you know how to dispatch work to those threads and you're scheduling and other things that the other approaches with processes would have to rely on the operating system to use. And everything's in the same address space, so you don't have to worry about using shared memory to communicate or inter-process communication to communicate for processes. You can just write in a memory and any other thread can read it. So as I said, this is the most common approach. Every single new database system written in the last 10 years is going to be using this with the exception of any system that's based on Postgres, because Postgres is based on the process model. So there's a lot of systems that took Postgres, hacked it up, and put out an optimized version of it, and they're going to inherit the process model from Postgres. So when we actually first started building our own system here at CMU, we did what everyone else does. We start with Postgres, rip up the bottom half and use the top half, but we decided that to get away from all the process model, we actually ended up rewriting Postgres to use multiple threads, use threading model. And I have seen some posts on the Postgres message board, where people are talking about, is it worth time for us looking at the switch to a multi-threaded model? Because the advantage is the context switch inside of a multi-threaded application is much cheaper than going across different processes. So again, in my opinion, the multi-threaded approach is the better way to do this. Again, you have less overhead from a context switch. You don't have to worry about using shared memory. The obvious downside is if you have a rogue thread that fucks things up and that thread crashes, you take down the whole thing. Whereas in a process model approach, the process per worker approach, if one worker crashes, then only that process is killed, the whole thing can still stay up. So the other thing I would point out, too, is just because you use the thread pool worker model doesn't mean that you automatically get intra-created parallelism. So MySQL uses multiple threads, but they can still only execute, at least as of 5.7, still can only execute a one query with one thread. They can't take a query and break it up across multiple threads. Postgres actually uses the process model and they actually now support intra-created parallelism. So there's a bunch of other stuff we have to deal with, as you can imagine, in when we have a multi-threaded environment, a multi-process environment, in terms of scheduling. So there's a bunch of other stuff we have to decide in our database system like who to hand off tasks to, how much resources those tasks should get, what cores, or CPU cores we just signed workers to. All these things we're not gonna cover in this class, we'll cover in the advanced class. But the way to think about it is the same kind of scheduling that an operating system has to do about its processes and its threads running in the system. We wanna do the same kind of thing in our database system because we know what the threads are actually doing. We know what data they're touching and we can be smarter about how we organize things. Again, so the Davidson always knows more, yes. So for the process model, what's the advantage of the process? So this question is for the process model, what is the advantage of the worker pool versus the process per worker? The idea, so the advantage is for this one here, for one query shows up in this worker, that worker is the only person that can execute it, right? In this one here, you could have free workers that the query shows up to this guy here, but he could say, all right, well I know these other guys are free here, so let me hand off one piece of my work to them and they'll execute it and then send the result back to me. Again, so this one would be one worker equals one query, this one could be one query, it could be used across multiple workers. And obviously you can do the same thing in the threading model approach. Okay, so as I said, the two type of parallelism we can have are interquery parallelism and introquery parallelism. So interquery parallelism just means that we're gonna have multiple queries be able to run at the same times in our system. If everything's read only in this environment, then this is super easy to do because we don't have to worry about any of the threads coordinated or workers coordinated with each other. This guy can run this query, I can run my query and we don't have to worry about them modifying anything. It's when you have queries updating the database at the same time, that's when trouble starts and that's where you need something like a current control protocol, which we'll discuss next week, to make sure that each worker only read the data that they're supposed to be reading. We potentially don't wanna read data from one worker before it has actually saved anything in another worker. So the idea is that a current control protocol is gonna allow us to give the illusion that each worker is running in the data by itself even though they're not really. Again, this is really hard, this is the thing that excites me the most in databases and we'll discuss more of this next week. For introquery parallelism, right, the idea is again for a single query can we break up its tasks into sort of subtasks and run those in parallel. The two approaches to do are introoperative parallelism and interoperative parallelism and what I'll say is that these techniques are not mutually exclusive just because you're using the first approach doesn't mean you can't also use the second approach but most people actually only use the first approach. And for all the algorithms that we've talked about so far to do joins, to do sorting, to do scans, there are parallel versions of all of these, right. Basically just think about the do hash join instead of having one thread scan the table and build a hash table you could have different threads scan at the same time and build the same hash table together. Of course now you need latches to protect them from each other. So again we'll go through these two approaches one by one. So introoperative parallelism is also sometimes called horizontal parallelism and the idea is that for a single sort of high-level logical operator in our query plan, like scan table A, we're gonna instantiate multiple instances of that operator and then execute them in parallel and then we're gonna introduce a new type of operator called the exchange operator that's gonna be a sort of a barrier that checks to see whether all the subtests for our single operator are completed and if so then we can then pass along the results to the next operator in our query plan. So the exchange operator was introduced in the late 1980s, early 1990s from the same author that did the volcano stuff we talked about before or the iterator model said also could be called the volcano. This all comes in the same paper and the same paper also introduces this exchange operator and in most parallel databases they have something like this that you would use to combine things together and this approach is also used in distributed databases as well. So let's say we have our query here we wanna do a join on A and B and we have a filter on A and a filter on B. So this query plan I'm showing here this is like the logical view of the query plan we're not saying any of our algorithms are using we're not saying anything about where the data is actually stored but now we can instantiate multiple when we actually go to execute it we can instantiate these different operators in parallel with each other. So take the first thing of this scan on A so we can say that we're gonna break it up to three subtas A1, A2, A3 that are each going to be assigned to a single worker like a single thread and they're gonna scan disjoint subsets of the table and then pass them up to the next operator in the query plan. But actually we can do some pipelining here because we see that immediately after the scan we're gonna do our filter where A value is less than 999 or A value is less than 99. So in addition to doing the scan we'll also feed the output into a parallel version of the filter which then feeds the data up even further. So now again since we know we have to join in for that each of these guys can then build the hash table we'll build a single global hash table for all these threads we'll just use latches to protect them and then now this then gets fed up into an exchange operator and again think of this just as a barrier it's just a sort of physical construct instead of a query plan execution that has no corresponding logical operator in relational algebra it's just something internally we're using to synchronize the different threads and say we can't go pass on a query plan till we get the result from all our subtasks. I just think there's a counter inside this thing that says I know I have three subtasks and every single time this guy there's one of these tasks completes and tells an exchange operator hey I'm done I don't have any more data to give you then you decrement that counter by one then when the counter is zero then you can take whatever this thing is called less together and feed it up to the next operator. All right so we can do the same thing for B say B is a smaller table and so maybe we only want to give it two cores or two workers to execute this and they'll just again do the same thing they'll scan in parallel do the filter and say we're doing the gray hash join where we want to partition this and so we'll partition these guys into separate buckets and then we have an exchange operator that says hey when you're done you know once all the subtasks are done then we can go complete this. So when this exchange operator is done we have a hash table when this exchange operator is done we have our buckets and then in the once these two exchange operators are done together we can actually now do the join. And again we do this join in parallel because we've split the data up over here and we can now do our probe and our hash table under different threads and producer outlet and the same thing you have exchange operator above that as the counter says when my four subtasks complete then I get my final result. So is this clear? It's just a construct inside the execution of this query plan that allows us to call us the results from these subtasks and these workers so that anything above it doesn't have to know that it was executed in parallel it just knows that I got all the data that I needed because without this then this thread might run faster than this thread so I might start executing things above in my query plan before I've processed all the data that I needed so I may end up getting false negatives or false positives. Yes? Why do we want to be part of the use part of this? Right, so you could split up this input so that all the data that this thing would have to access in one of these threads up here would be all within a single partition so that you're only mostly hitting the same part of the hash table inside this so you get better cash locality. It's like the grace hash joint or a variant of this. It's a variant of it, it's partitioned. Okay, so that's intra-operative parallelism and the idea is that we're taking an operator we would normally execute on a single thread and just breaking up across multiple threads. So with interoperative parallelism or vertical parallelism the idea here is that we're gonna execute our operator and it doesn't have to be a single thread but for our purposes we'll say it is. For each operator it will be assigned to a single worker that will just be always running and it'll spit up output to the next, it takes an input, does whatever computation it wants on it and then spits it up to the next operator when it's done and then now we can have these different workers running in parallel at these different operators and I think your textbook calls this pipeline parallelism. So we just take our example from before the same query and say we take this join operator, we don't care what comes below us, what's actually feeding data into this but we'll just assign this to a single worker that's just gonna again iterate over the inner table and the outer table however it gets it and then when it has a result it emits it as the output up above and then in another worker we're gonna have a thread spinning over and doing our projection on any data that comes out of this. So again, we'll have one thread spin down here, another thread spin down here and anything that gets this input it is just the computation and it spits it as the output. And then again for every single operator in my query plan I could assign them to a different worker. So what's the obvious problem with this? Yes? Exactly, it says two can only go as fast as one. So this top guy here, he's just gonna sit and wait until it gets data to crunch on. So we're sort of assigning a worker to this task but it doesn't have anything to do. It's not gonna do it, hopefully not doing a busy loop and just burning cycles that way but it's essentially we've allocated resources that can't be used. So as far as I know, no sort of traditional relational database internet system does vertical parallelism. Everyone always does the horizontal parallelism. Where you see this kind of parallelism show up is in what are called either stream processing systems or continuous query systems. Think of something like I have a SQL query, I give it to my system and it's just always running listening on stream of updates from some outside source like think of like Kafka or something and it's just always running and any time a new query comes in then it runs it through the pipeline and processes it. So you see this approach used in Spark streaming, in I-Fi, Kafka, Storm, Flink and Heron. It doesn't make sense in sort of an ad hoc query system because the exact thing he said because the operators above in the tree aren't gonna have any work to do so you're basically wasting resources but in a continuous query system in a streaming system there's data coming in all the times you're always gonna have something to do. Okay, all right. So the everything we talked about so far has been all about how do we take our query plan and break it up into sub tasks and run those sub tasks on additional workers, additional computational horsepower that we have available to us. But the obvious problem is gonna be that if we had to ever get data from disk then it doesn't matter how many CP cores we have the disk is always gonna be the main bottleneck because this is gonna be way slower than everything else. And actually in some cases if we have a lot of threads trying to read data from disk at the same time then we actually get even worse performance because if we have a mechanical drive then there's an arm jumping around the different points in the platter and that's gonna be even slower. So we need to figure out how can we parallelize our input data or the access methods to get data from non-bottle storage and bring it to our buffer pool as quickly as possible. So this is what IO parallelism is attempting to solve. The idea here is that we're gonna take our database management system installation, not database itself but actually the system installation itself and split up this data that is storing across multiple storage devices so that we can issue these requests in parallel. There's a bunch of different ways that you can approach this. You can take a single database and have it be stored on across multiple disks. You can have one database per disk. You can have one relation per disk or you can split up a single relation across the multiple disks using partitioning or sharding. So the easiest way to do this without changing any code in our database system is to do multi-disk parallelism and this is where we just configure the hardware or the operating system to take multiple drives and multiple disks and then have them appear as a single logical disk to the database system. So our database system code doesn't know that we're doing, when we do a write or a read to our disk device that it's actually multiple disk devices that could be servicing our request. It just also looks like a single device. So we can get this through our storage appliance so if you buy a sense of box that has multiple drives together you can do this for you or you can do this using RAID, right? Everyone here should have heard of RAID before, right? Good, okay. So the basic idea of RAID is you wanna make a bunch of cheap devices that are not cheap. Much of independent storage devices appear as a single device and the two most common ways to do this are RAID zero which is striping where this is we have three different devices and six different pages and each device is just in charge of storing each one of those pages. So now when my request shows up from a database system that says give me page one there's some controller or the operating system knows that here's the device that actually has the data that you want to go get it, right? The alternative is to do RAID one or mirroring where each now device has a complete copy of the database and so now when I wanna do a read for page one any one of these three devices can service me. So now if you have one guy's reading page one one guy's reading page two you can have one device do hand on one request one device hand on another request. Again the key thing here is that this is all transparent to the database system we don't know anything about these things being stored in different locations we just know that there's this one file or one table heap we can read and write from. If we now start moving into our database system and say what can we actually do to make things run faster by controlling where the data is actually being stored? This is called data is partitioning. And the idea here remember we talked about the page directory would maintain information about for given page or table heap what file or what directory was that data actually being stored. So this is where we can sort of do this take advantage of different devices at that level. So the idea here is that when we go back to page we do some look up and figure out well what device actually has our page? And then we can multiplex them and be smart about storing different data that's maybe used together often on different pages so that we're not trying to saturate this single device. There's some tricky things we have to handle in this environment. We won't talk about logging just yet but now you have the problem of say I'm modifying data on two different storage devices where to actually put the log keep track of what those changes are. And a bunch of other things we can cover when we talk about logging later in the semester. So partitioning or table partitioning the idea here is that we're going to split a single logical table into these disjoint subsets or segments that we can then store and manage separately in different locations on different disks. So again ideally we want this to be transparent to the application. I mean we don't want to write we don't have the user write SQL statements that are designed to go to a different storage device because that way we can move things around underneath the covers without anything without making any changes to our application. So most of the time when people this is probably still common but a lot of times people start off using MySQL and they hit the limit what they can do with MySQL on a single box so they end up doing what is called sharding where they have this middleware layer that can route queries to the actual storage device that has the data that they want or MySQL installation that has what they want. Like Facebook is the most famous one for doing this. Google used to do this as well. Ideally we don't want to have to have that application code or middleware write ourselves we want the database system to do this for us automatically. Right and some systems can actually do this other ones cannot. So there's two types of partitioning we can do. Again we'll cover this more in detail when we talk about distributed databases but the high level idea is still the same. So this first is called vertical partitioning is that we're gonna split the tables attributes into different locations or different files in different storage devices and whenever we need to get all the data we need to reconstruct the tuple we know how to do fetches to those different locations. Right so in the same this case here attribute four is really big and the other three attributes are small so maybe what we'll do is have all this data here stored in one partition and then the remaining three attributes are stored in another partition and then any query that I write against this table knows how to stitch these things back together. What does this sound like? We talked about this before. Column store right, same idea. This is actually a very common approach actually Wikipedia uses this. Wikipedia has the for all your revisions you have the text of the revision you have the metadata about the revision they actually store them and actually separate tables and essentially is doing vertical partition. It's sort of like a poor man's column store. What mostly the thing about when we talk about partitioning is usually what is called horizontal partitioning or if you're coming from a no SQL system this is what they call sharding. The idea is that we're going to take all the attributes of a single tuple and store them in a single location but we're gonna split up what tuples are stored in what location. So in this case here we have four tuples we'll put two tuples in partition two and two tuples in partition one and then we have some additional information inside our system when a query shows up and says oh I want data for this key I know how to go to the right partition and get that data that I want. Again we can store these two different partitions on different files on disk and different storage devices or even different machines, right? And so the way you can split this up we'll cover about later is you can just do simple things like take a single attribute and hash it and that decides what partition you're gonna go to. You can do range partitioning or predicate partitioning or more complicated things. Different data systems do different things and different workloads want different things. So for OATP when you're doing like single key lookups hash partitioning works great because you just take whatever key they're looking up, hash it and that tells you what partition to go to. For doing OATP queries then that may not be the right thing to do, okay? I'm sort of rushing through this but I think a high level idea it should be pretty straightforward. Right, so parallel execution is important. Almost every single data system will actually do this and all the stuff I glossed over in a big way like how to coordinate different threads and how to schedule their operations. Well some of this we'll cover next week when we talk about concurrency control and some of this we'll cover in the advanced class. All right, so next class on Wednesday it's sort of a popery lecture where we'll talk about different ways to embed logic inside our data system to make it do more complicated things than we can do just through SQL. So we'll talk about store procedures, user run functions, triggers, views. But the idea here is that gonna be instead of having all the logic for our application inside of our application we can push some of this inside of our database system and have that run more efficiently there because we're closer to the data. I mean we don't have to go back and forth. Okay, all right, so in the last three minutes I wanna talk about the extra credit assignment. So the issue we're doing for extra credit is that you can earn 10% in your final grade if you write an article for a online encyclopedia that we've been working on called the databases of databases. So the idea is that we wanna write a sort of an academic style technical article about one particular database system, pick whatever one you want and you revise citations about to explain how it's actually implemented, what's going on on the inside. So I don't care about sort of marketing things to say, oh, it's fast. I really care about how does it actually implement a convergent goal? How does it actually implement logging or indexes or buffer pool management? The things that we talk about in this course that's the kind of stuff I want in this article. So the website is dbdb.io, right? So I'm currently aware of 560 different database systems and there's like a search panel to say, show me all the different database systems that use two-phase locking or use B plus trees and then for each of them there'll be an article that sort of summarizes the major points about their implementation, right? So the way this is gonna work is I'm gonna post on Piata a sign-up page, a spreadsheet on Google Docs, pick whatever database you want to write about. I'll have a list of ones where we already have articles from previous years, so you can't choose those. And then if it's, I'll help you sort of guide you to say, here's what I expect for what you need to fill out for a complete article, right? So if you pick something super common, like Oracle, then there'll be a ton of information about this. So I expect the article to be very comprehensive and complete. If you pick an obscure one that no one's ever heard of that's defunct, then you may have a hard time finding information and that's okay, but I wanna know this ahead of time so we can set your expectations for how much work is gonna take you to do this. So it's up to you to pick whatever database system you want, right? As I said, I am personally aware of 560 different database systems. So you will have no problem finding a database system that will be interesting to you, right? Do you care about graph databases? Do you care about distributed databases? Do you care about in-memory databases? Do you care about databases written in China? Do you care about databases that are written in the US? I try to annotate all these different systems so you can just go and pick whatever one is actually interesting to you, okay? So if you find a database system that I'm not aware of, please let me know, because we wanna add it to the list. I spent the summer looking at old like Usenick posts for like 1990s to find like guys in their basement writing a database system and we had them listed in the system, okay? So if you can't find a system that interests you, something's wrong with you, right? All right, and again it goes without saying please do not steal, please do not plagiarize, do not take their documentation and copy and paste it directly into the article, right? You have to provide citations for all the information that you add to this, right? If you say in this data system you use two-phase locking, I need a citation to the documentation that says they actually do this, okay? All right, so I'll post the sign-up sheet on Piazza. It's first come, first serve for the different databases. What database do you wanna pick? And again, I'll have the list of ones that you're not allowed to pick because we've already done them. And then we will, I'll post on the website, further instructions with information about how you actually, what should be expected in the article and what the different categories of features mean. Okay? Any questions? Ha ha ha ha, that's my favorite all about it. Ha ha ha. Yes, it's the S.P. Crooked I.D.E.S. I make a mess unless I can do it like a G.O. Ice cube with the G to the E to the T.O. Now here comes Duke, I play the game where there's no roots. Homies on the cuss of yama foo cause I drink brook. With the bus a cap on the ice bro. Bushwick on the go with a blow to the ice. Here I come, Willie D. that's me. Rolling with fifth one, South Park and South Central G. And St. I.D.E.S. when I party. By the 12-pack case of a four. Six-pack 40 act against the real promise. I drink brook, but yo I drink it by the 12 ounce. They say bill makes you fat. But saying eyes is straight so it really don't matter.