 Let's get started. How you doing? It's pretty rough. You're rough? You look down. What's wrong? You got some women problems. Women or woman problems? What did you say? Some women problems and also some men problems. Men problems? Yeah. What are your problems? They've been saying that my beats are too fresh. No, they're too trill and they just can't handle that. People are saying your beats are too fresh and they can't handle it? Yeah, that's gotta take a day by day. I'm not qualified to help you with this. I'm sorry. Okay. All right. So let's talk about databases then. All right. So getting a quick reminder, homework one is due tonight. And then project one is going out today. I'll announce it at the end of the class. It's actually on the website now. The source code is online. But I'll discuss what it is, what you're required to do today. And then just like before, you'll submit this on Gradescope. And everything will be auto graded. All right. I do want to spend some time talking a little bit about work loads is where after you've collected a bunch of data in the OOTP side, now you want to start analyzing it to extrapolate new information. Like people in the city of Pittsburgh are more likely to buy this kind of product. So that you can use that information to then push information to the OOTP side to get people to do things you want them to do. And then the hybrid transactional analytical processing, HTAP workloads. This is sort of a new buzzword that Gartner invented a few years ago, basically describing these database systems that try to do both of them. So a typical setup you'll see often is like this. You have your front end OOTP databases, and then you have your giant back end data warehouse. So these are sometimes called data silos, because you can do a bunch of updates into the sort of one database instance, whether it's a single node or distributed matter, if it's a single logical database, and then you supply your changes here, but they don't really communicate with each other. Each one is sort of an island by itself. So then you're going to do what's called extract, transform, and load, or ETL. And this is sort of the term you use to describe taking data out of these front ends, cleaning it up, processing it, and then putting it to the back end data warehouse. So the example I like to give for this is like Zynga, the farm real people, they buy a lot of gaming startups, and then when they buy them, they all run their own front end OOTP databases. But then when they want to put it in their back end giant data warehouse, so they can do analyze things to make you buy crap on Farmville better, right? And so because say like in one database, the first name of a customer will be F name, another database will be F first underscore name. So it's the same concept or same entity just with different syntax and nomenclature. So this ETL process cleans all that up. You shove it to your data warehouse, you do all your analytics here, and then whatever new information you have, you push it to the front, right? And then when you see things like people that bought this item also bought this item, that they're doing that on the OLAP side, and then they shove it to the front end to expose that through the OOTP application. So HTAP basically says let's just also do some of the analytical queries that we can normally only do on the OLAP side. We can do it on the front end data silos. You still want this giant thing, your giant data warehouse, because you want to be able to look at all your data silos put together. But now instead of waiting for things to be propagated to the back end, you can do some things on the front end. So that's basically what HTAP is. So again, this could be like your MySQL, this is your Postgres, MongoDB, whatever you want. And then your back end data warehouse would be Hadoop stuff, Spark, Green Plum, Vertica, there's large enterprise data warehouse systems, Redshift or Snowflake or other cloud ones. Okay, so this is clear. Okay, so the main topic today we're going to talk about is now, given that we've already spent two previous lectures on deciding how we're actually going to represent the database on disk, now we want to talk about what we actually do to bring that database from those files on disk, the pages on disk, and bring them into memory so that we can operate on them. Right? So remember that the database system can't operate directly on disk. We can't do reads and writes without having to bring in that memory first. That's the von Neumann architecture. Now there are some new hardware coming out that you can push execution logic down to the disks, but we can ignore that for now. So we're trying to figure out how do we want to bring those pages into disk, and we want to do this and be able to support a database that exceeds the amount of memory that we have. And we want to minimize the impact or the slowdown or the problems of having queries to have to touch data on disk. We want to make it appear as if everything's in memory. So another way to think of a problem is also in terms of spatial versus temporal control. So spatial control is where are we physically going to write this data on disk? Meaning we know that these pages are going to be used together often, possibly one after another. So when we write those pages out, we want to write them sequentially so that when we go read them again, they'll be physically close to each other, and we don't have to do long seeks to find different spots on disk. We also care about temporal control, and this is where we make decisions about when do we read pages into memory, what time we do this, and then at some point we have to write it back out if it's been written, or if it's been modified, and we want to make a decision of when we actually go ahead and do that. And again, this is the overarching goal of trying to minimize the number of stalls we have because our queries try to read data that we didn't have in memory, and we had to write up, you know, that was out on disk. We had to go fetch it. So this is the overall architecture of the lower store manager that I showed in the beginning. So we've sort of covered this part already. So now we know how to have a database file or files on disk. We know how to represent the page directory to find the data we need, and then we have a bunch of pages, slotted pages, long structure pages. It doesn't matter. We have a bunch of pages out on disk, and we know how to jump to them to find them. So now we're talking about this part up here at the buffer pool, right? When something else in the system, like the execution engine, the thing executing queries comes along and says, I want to read page two, we got to know how to fetch the page directory into memory, figure out what's in there, and then go find the page that we want and fetch that into memory. And then the tricky thing is going to be, if we don't have enough space, don't have free memory to bring that page we need in, we have to make a decision on what page to write out. So that's, you know, this is what we're trying to solve today, right? And then the other parts of the system don't need to know or really care about what's, you know, what's in memory, what's not in memory. They're just going to wait until you get the thing that you need and then give you back a pointer to let you do whatever it is that you wanted to do. Okay? So the things we're talking about today is essentially just how to build what a buffer pool manager actually is going to do. And I'm going to use the term buffer pool manager. Some systems will call this buffer cache. It's the same thing, right? It's memory managed by the database system. Then we'll talk about how we actually can do different policies to decide what pages we want to write out like this, what pages if we need to free up space, what additional optimizations we can do to minimize this impact, and then we'll finish up talking about other pieces of the database system that may need memory. Okay? So, again, the buffer pool is essentially just a large memory region that we're going to allocate inside of our database system. We're going to call malloc and we're going to get some chunk of memory, and that's where we're going to put all our pages that we fetch from disk. And so this is, again, this is all entirely managed by the memory, by the database system other than having go to the operating system and ask for the memory, right? We have to use malloc. We can't just malloc or allocate memory on our own. So OS is going to provide us this. But then we're going to break up this memory region into thick size or page size chunks called frames. And this is, you know, frames seems kind of unusual. Why don't I just say page or block or whatever? There's so many different terms in database systems to roughly describe the same thing. So frames correspond to slots in the... See, I use the term slot. I'm going to use that. Frames correspond to regions or chunks in the buffer pool memory region that we can put pages in, right? And slot is the thing we put things into pages within for tuples. So for buffer pool, it's frames. For all on the page, it'll be slots. So what happens is when the database system calls, makes a request and say, I want a page, right? We're going to look to see whether it's already in our buffer pool. If not, then we go out in the disk, make a copy that should then put into memory. So this is a straight one-to-one copy. We're not doing any deserialization. We can ignore compression for now, but however it's represented on disk, is exactly how it'll be represented in memory. We're not doing any marshalling of the data. We just take it from the disk and put it directly into memory. And we keep doing this for all the other pages that we may need. So in my earlier example, when I showed how the execution ender says, hey, I want a page two, right? And the buffer pool manager magically figured out what page two is. So if we're just organizing these things as frames, pages can go in any order in the frames that they want, right? In this case here, even though it's page one, page one, two, three, in my buffer pool, it's page one, three. It's not in the same order that it's out on disk. So we need an extra indirection layer above this to figure out if I want a particular page, what frame has the one I want. We're not going to match exactly in the same order that it is on disk. So this is what the page table is. The page table is just a hash table that's going to keep track of what pages we have in memory. And if you ask for a particular page ID, it'll tell you what frame that it's located in, right? And so the database system is going to have to maintain some additional metadata to keep track of what's going on with the pages that it currently has in its buffer pool. So the first thing we got to keep track of is called the dirty flag. And this is just a flag, a single bit that tells us whether the page has been modified since it's been read from disk. Did some queries, some transactions make a change to it. The other thing we got to keep track of also is what we call a pin counter, a reference counter. And this is just keeping track of the number of threads or queries that are currently running that want this page to remain in memory. Meaning we don't want to written out the disk. It could be because I'm going to update it, so I do my fetch, I go fetch the page I need, bring it to my buffer pool, then I'm going ahead and modify it. I don't want that page to get evicted or swapped out back to the disk. In between the time it's been brought in and before I can actually do my update to it. This also is going to prevent us from evicting pages that have not been safely written back to disk yet. So again, I get a pin and page and say, I don't want this thing to ever be removed from the buffer pool for now. And then say I'm reading a page here. Sorry, I want to read a page that's not currently in memory. I want to put a latch on this entry in the hash table so that I can go fetch the page and then update the page table to not point to it. And I have to do this because multiple threads could be running at the same time. I can't assume that I'm the only person looking at the page table, so I want to prevent somebody else from taking this entry in my page table. And while I'm fetching the page that I need, they come and steal it from me and put something else in. So again, we'll see this as we go along later in the semester, but there's a bunch of extra stuff we have to do to keep track of what pages have been modified. So the dirty bit is just sort of one piece of it. But we also need to keep track of who actually made the modification. So because we want to write a log record to say, here's the change that was made. We're going to make sure that log record's written first before our page is written. This is another example of why M is a bad idea I can't guarantee the operating system's not going to write my page out the disk before I want it to. Because it doesn't prevent you from doing that. At least on Linux, FreeBSD can let you do this, but Windows and Linux don't want you to prevent this. All right. So this is clear what we're trying to do here. Basically managing our own memory, but we're keeping track of how the transactions or queries are modifying the pages, and we have to protect ourselves in the page table to prevent anybody else from evicting things or overwriting stuff before we're done with what we wanted to do. Any questions? Okay. So I need to make a very important distinction now about the difference between locks and latches. So this will come up later on. You'll have to do this for the first project as well. If you're coming from an operating system background, in their world, a lock is what we call a latch. Let me describe both of them in the context of databases, and I'll describe how they map into the OS world. So a lock in the database world is some higher-level logical primitive that's going to protect the contents of the database, the logical contents, like a tuple, a table, a database. And the transaction is going to hold this lock for its duration while it's running, which means it could be multiple queries. And this could be, you know, multiple milliseconds or multiple seconds even, or even minutes or hours if it's a really long-running query. So in that world, again, this is something that a database system is going to provide to us and expose to you as, like, the application programmer. You can see what locks are being held as you run queries. Latches are the low-level protection primitives that we use for the critical sections of the internals of the database system, like protecting data structure, protecting regions of memory. And so for these, these latches we're going to hold for just the duration of the operation that we're making. Like, if I go update my page table, I take a latch on the entry, on the location of that I'm going to modify, make the change, and then I release the latch. And we don't need to worry about rolling back any changes in the same way we do for locks because it's an internal thing, like updating the physical data structure of the database system, I make the change and actually get the latch that I want, then I just abort, and don't worry about rolling back. Yes? What does rolling back mean in this context? Okay, so he says rolling back changes. This will come later on when we talk about currency troll, but basically say, like, I want to take money out of my bank account and put it in your bank account. So we take money out of my bank account, but then the system crashes before I put the money in your account. I want to roll back the change I made to my account because I don't want to lose that money. That's what I mean like that. We'll discuss the whole lecture on currency troll. For now, the main thing, we're focused on this thing here, right? So, again, in the operating system world, this would, a latch would be something like a mutex. We're actually going to use mutexes in our database system to protect the critical sections of things. So I will try to be very careful and always say latch when I mean latch, but occasionally I slip up and we'll use lock. But if it's an internal thing, we mean latch. It's also very confusing, too, because the mutex implementation you would use to protect for a latch is called a spin lock, right? But it's really, you know, this thing and not this thing. Okay? All right. So the other distinction we want to make is the difference between the page directory and the page table. So remember, the page directory is what we're going to use to figure out where to find pages in our file. So we want page 123. It'll tell us what file at what offset or what set of files have what we're looking for. So all the changes we're going to make to the page directory have to be durable. They have to be written back to the disk because if we crash and come back, we want to know where to find the pages that we have. The page table is an internal in-memory map that just maps page IDs to where the frames that they are in the buffer pool. So this thing can be entirely ephemeral and we don't need to back it by disk because if we crash and come back, our buffer pool is blown away anyway, so who cares? So the page directory has to be durable. The page table does not have to be. And that means we just use whatever your favorite hash map or hash table implementation you want. For project one, you just use S2D map, that's fine. Because again, we don't have to worry about this thing being durable. We have to make sure it's thread safe, certainly, but not durable. All right. So now when we start talking about how we want to allocate memory in our database for the buffer pool, we can sort of think about this in two different ways. So the first is that we can choose what are called sort of global policies where we're trying to make decisions that benefit the entire workload that we're trying to execute. We look at all the queries, all the transactions that are going on in the system, and we try to say at this point in time, what's the right thing I should do for choosing what should be in memory versus not in memory? An alternative is to use a local policy where for each single query or each single transaction we're running, we try to say what's the best thing to do to make my one query or one transaction go faster, even though for the global system, that actually might be a bad choice. So there's no one way that's better than another. Obviously, there's optimization you can do if you have a global view versus a local view, but then for each individual query, you might be more tailored to what they want to do to make that run faster. So as we've seen in much of these examples as we go along for optimizations, the most systems will probably try to do a combination of the two of them. What you'll be implementing for the first project is considered a global policy, because it's just looking at what's the least recently used page and removing that, even though that may be bad for one particular query. So that's basically all you really need to know about how to build a buffer pool. It's just you have a page table that maps page IDs to frames and then you look in the offset in your allocated memory and that tells you here's the page that you were looking for. It seems pretty simple, right? So now we want to talk about how to actually make this thing be super awesome or super tailored for the application that we're trying to run or the workload we're trying to run inside of our database system. And this is going to allow us to do certain things that the operating system can't do with what queries you're running. It doesn't know what data they're touching and what they're going to touch next. So now we can talk about what we can do to make this thing do better than what a naive scheme would do. So we'll talk about how to handle multiple buffer pools, prefetching, scan sharing, and then the last one would be buffer pool bypass. So in my example that I showed, I referred to the buffer pool as a single entity. The data system has one buffer pool. In actuality, you can have multiple buffer pools. So you can have multiple regions of memory you've allocated. They each have their own page table. They each have their own then mapping from page IDs to frame IDs or frames, right? And the reason why you want to do this is now you can have for each buffer pool, you can actually have a local policy for that buffer pool. That's tailored for whatever is the data that you're putting into it. So for example, I could have a single buffer pool for each table because maybe some tables are doing a bunch of sequential scans and some tables I'm doing point queries or I'm jumping to single pages at a time. And I can have different caching policies or different replacing policies to decide based on the two workload types. I can't do that easily if it's a giant, just a giant buffer pool. Well, let's say I can have a buffer pool for an index, a buffer pool for tables, and then they have different access patterns and then I can have different policies for each of those. The other big advantage you also get are the different threads that are trying to access it. So when I do that look up in the page table, I have to take a latch on the entry that I'm looking at. As I go find the frame that has the data that I want, I don't want to make sure that nobody else swaps that out. But from the time I do the look up, from the time I go get the page that I want. And so that means that I could have a bunch of threads all contending on the same latch because they're all accessing the same page table. So no matter how many cores I have on my brand new machine, I'm not getting good scalability because everything's contended on these critical sections. But now if I just have multiple page tables, each thread, you know, they could be accessing different page tables at the same time and therefore they're not contending on those latches and now I get better scalability. Now still bottlenecked on the disk speed, which is always a big problem, but at least internally now I'm not worried about them trying to acquire the same latch. This is something you see mostly in the enterprise or expensive database systems. So Oracle, DB2, SideBase, Informix, SQL Server, all support this ability to have multiple buffer pools. DB2 can do all sorts of crazy things. You can create multiple buffer pools. You can sign them to different tables. You can have different caching policies for all of them. You can set them to different page sizes. My SQL, even though it's open source, actually has this as well. It's not that sophisticated. You can see how many buffer pool instances you want and then they just do round robin hashing to decide what, you know, if you're forgiving page ID, where's the data that I'm looking for? What buffer pool has it? So there's two ways to use these things, right? To map the thing that you're looking for to a buffer pool that has the page that you want. So basically what happens is, if you have multiple buffer pools, you can't have a page in one, you know, in buffer pool one this time, but later on it comes in another one. It always wants to be in the same location so you know how to find it quickly. So the first approach is that you can actually extend the record ID to now include additional metadata about what database object this buffer pool is managing. So if you recall when we looked at the record IDs of Oracle and SQL Server, they had extra columns, extra information that Postgres didn't have. Postgres had the page and the slot number. So they also had like the object number, page number, and then a slot number. So we could use that additional object number to then have another map that says for object, you know, X, Y, Z, it's in this buffer pool or that buffer pool. So now the request from the system is saying, give me record one, two, three, and I know how to split that up and find out what object it corresponds to and what buffer pool will maintain that data. Again, I think this is what MySQL does, it's pretty simple. You just take the record ID, you hash it and mod n by the number of buffer pools you have and that just tells you where to go get the data that you want. And you can do this really quickly, really fast, it's not an expensive operation. Actually, for either, it's not an expensive operation. All right, the next optimization we can do is to do prefetching. So the idea here is that again, we want to minimize the stalls and the data is due to having to go to disk to read data. So if we start doing like a scan and our buffer pool is empty this query wants to read page zero. Page zero is not in memory, not in our buffer pool. So we have to stall that thread until we go out the disk, fetch it and then put it into our buffer pool. Then once it's in our buffer pool then we hand back the pointer to the upper level of the system and say the page you wanted is now here in our memory, go do whatever it is that you want to do. So the way to think about this is like you can think of this arrow as like a cursor. So internally the data system is going to keep track of this thing called a cursor, like as you iterate over every single page that your query needs you just know where you left off the last time. So when you go back and say give me the next page it doesn't start from the beginning, it jumps where you left off. So in this case here I get page zero, I'm done. Now I start reading page one, same thing I have to stall because it's not in memory the disk goes and gets it we put it in our buffer pool and then once I have that now I can proceed operating it on. So let's say this query here wants to scan the entire table right, for our table here here's all the pages. So at this point the data system can probably recognize oh I know you're going to end up scanning the entire table, so rather than me waiting for you to ask for each page one after another let me go ahead and jump ahead and say oh I think you're also going to need page two and three. So let me go prefetch that for you, put it into the buffer pool so by the time you finish processing page one and now you go ask me for page two or page three it's already there now you don't have a stall. And again based on how I laid out these pages on disk that might have been a sequential read which is super fast. So by prefetching things ahead of time I'm minimizing the amount of random IO that I'm doing. So let's keep going this down and prefetch everything so again that minimizes the impact of these disk stalls. So this example is pretty simple. The operatives you can figure this out too and M-Map will actually do this for you. So in M-Map you can path the flag and say I'm going to do a sequential read on these pages on disk and it'll go ahead and prefetch a bunch of them ahead of time. And so again that will minimize the stalls because you had to read something from disk. So M-Map can figure this out without even knowing anything about what the query is trying to do. And the data system knows what the query wants to do and can go prefetch ahead of time. But now there's going to be some queries where the operating system is not going to be able to know what to do but we do in the database system because we know what the query wants. So an example of this would be like an index scan. So let's say I want to do a scan on this table and I want to get all the values I want to find all the tuples where the value is between 100 and 250. So now let's say that I have an index on that value and I haven't explained what an index is just think of this as like a glossary in your textbook that allows you to jump to a particular page that has the data that you want. So instead of just doing a sequential scan I can just jump through the index and find exactly what I'm looking for. So let's say that in our index pages we know ahead of time what the ranges are. So when my query starts to do that scan I always got to read the first page for the index because that's the root. So I had to jump to there. But now I'm going to do a look up and say well I'm looking for my query was between 100 and 250 so I know that all the pages I need the values I want where it's greater than equal to 100 are going to start on this side of the tree so now I'm going to jump down into page one and read that. That's still sequential at this point so again the operating system can probably figure this out but now I'm going to branch and go down here and I'm going to scan across the leaf nodes but this is index page three, index page five they're not contiguously of each other on disk and so the operating system may try to end up prefetching page two and page three but I don't need page two that's wasted and I need page five and I didn't prefetch that. So because we know what the query is going to do we can go ahead and prefetch exactly the pages that we want to put them into our buffer pool because we understand what's actually the context what are the context of the query and what are these pages actually representing because the operating system just sees pages it doesn't know what's in them but we know because we wrote this code we know that these are index pages and they're connected together in some way so we know how to do this traversal so this doesn't come for free right there's an extra metadata we have to keep track of in these pages to say like here's the sibling here's my starting point or my endpoint here's his starting point so I know whether I'm going to scan across over here and actually I can't know whether I need five before I look at three so I'm not saying this is like super easy to do but you can kind of see again how we may not be jumping exactly through the pages sequentially in a way that the operating system is not going to be able to find so again this to me is the classic example of what we can do in our database system that the operating system cannot do because it doesn't know about what's in the data it just sees a bunch of region rights the next optimization we can do is called scan sharing so the idea here is that we can have queries piggyback off each other and reuse the data that one query is reading from disk and use that for its query so this is different than result caching result caching is you say I run exactly the same query and I compute some answer and I cache that result so that same query shows up again I can just rather than rerunning the query to give you the answer I had before this is at a lower level at the buffer manager in the storage layer where we now have this cursor accessing pages we can then reuse the pages where we're getting out from one thread for another thread so what it's going to work is that we're going to allow multiple queries to attach to a single cursor that's scanning through our pages and putting them to the buffer pool it's almost like a PubSub thing where we say I want to know whether you get a new page and then you can notify whatever thread that may be waiting for it even though they're not the one that actually did the read so depending on the implementation the queries do not need to be exactly the same typically in result caching they do in our world here they don't have to be I need to know whether I'm meeting the same pages and then in some cases too also if they're computing a similar result we could share those immediate results across different threads it's almost like a materialized view we'll cover this later in the semester but for our purposes here we're just looking at page accesses so again the way it works is that if a query starts a scan and then it recognizes that there's another query also doing the same scan it just attaches itself to the first guy's cursor and then as it gets pages we get notified that that page came in and we can access it as well so the important thing to know is that we have to keep track of where the second query came along sort of got on the train for the cursor so that we know if the cursor ends for the first query there may be other data we have to go back and read if we want to look at everything we start halfway we want to know where we started so we can come back and see the rest so as far as I know this technique is fully supported only in DB2 and SQL server it's super hard to get correct it seems like it kind of trivial but it can get pretty gnarly based on what the query is doing Oracle supports has a basic they call cursor sharing but it only works if you have two exact queries running at the exact same time whereas these guys can extrapolate based on the query I know you're reading this table I need to read the same thing and jump on it this thing has to say I have two queries that are doing the exact same thing so let's look at an example so say we have our first query here it's computing the sum on A so the query's cursor is going to start and it's just going to start scanning through the table looking at each page so now let's say at this point here it wants to read page 3 we don't have a free frame on our buffer pool so we run our replacement policy algorithm to decide which of these pages we want to remove in this case here I would do something simple and say well page 0 was the last page the page that was the oldest since I've accessed it so let me go ahead and replace that with page 3 and then now I continue scanning now let's say after this happens after we swap out page 0 and page 3 a second query shows up that also wants to do a central scan on this table so without scan sharing it'll just start at the beginning like the first guy and just scan all the way down but this is actually the worst thing for us because the first thing it's going to read is page 0 but we just threw that out on disk so now we can end up thrashing because this guy can't proceed until page 0 is in so this is going to, you know, it has to stall to go fetch it back in but I just had it in memory but I got rid of it so that's bad so with scan sharing this guy just hops along for the ride and reads the same thing that Q1 reads and, you know, produces whatever intermediate result it needs for that part of the data it's looking at so now at this point Q1 is done so its cursor goes away and then Q2 starts over at the beginning and knows that oh I started when you were reading page 3 so this is how long I need to scan down until I get my final result yes the part of the memory like where these queries store their data like where they are purchasing they must also need some memory for storing their data right so that is separate from the worker query this question is each query is computing I'll say intermediate results as it reads this data so they also now need a memory region to put this data in that's separate from this buffer pool yes so we'll see this in an example in a second but the typically that memory will also be backed by a buffer pool right and because if I end up computing something say I'm computing a join and the output of that join operator doesn't fit in memory I need to start evicting those pages out to disk so any ephemeral memory like that will still be backed by a buffer pool but whether it's in the global buffer pool whether it's a private one for the query it depends on the implementation but we don't need to bring pages from the disk for that buffer pool right his question statement is I don't need to bring pages from disk in for that query intermediate result buffer pool I'm still unless we're storing because it's too big yeah so as I'm writing data like so this guy and this is trivial table because the average is the scalar right this is you know some really complex computation as I'm generating as I'm scanning this data I'm updating my intermediate result that I may overflow memory and those get swapped out the disk so I'm writing to memory and then they would just get written out the disk as needed but it's not like I would read for my query yeah it doesn't make sense because anything you need to read from like the low level data pages you're going to put in the buffer pool that everyone can see right so again this is another good point this is a shared data structure right so like Q1 was reading pages and putting into the buffer pool any other thread that needed these pages is allowed to go go ahead and read it right the pin latch the pin that just tells you that hey don't swap this out the disk the doesn't doesn't prevent anybody else from reading at the same time you are there's higher level things like the locks that keep track of what pages you're allowed to read and write from or what would you know with database objects this is the pin just basically says hey I'm operating on this don't swap it out so that ends your question okay so there's another good example with awesome about the relational model because the relational model is unordered meaning like it doesn't like I can actually have Q2 start anywhere for some queries and the answer I'm going to produce may be different from based on when I execute it but it's still considered correct so if I change this query to put compute the average and I limit it to 100 meaning I only want to compute the average of 100 tuples doesn't specify that I can only I have to look at the first 100 tuples so I can start here page 3 with my scan sharing on this cursor and see the first 100 tuples in these first 3 pages and then that's that's enough for me to compute the result if I started now at the beginning I may actually get a different result according to the relational model that's still fine because the database is unordered yes yes perfect so he says would it also still be valid if we rather than having the cursor say alright well let's go look my disk pages are fetching them where I go check in the buffer pool and figure out what's actually in memory and compute the aggregation of this particular query with what's in memory absolutely yes and the smarter systems can do that again it doesn't matter right it's in memory as long as I see 100 tuples then that this query is still correct now this is you wouldn't want to write this but it's still valid alright the last optimization I'll talk about is the buffer pool bypass so it's sort of related to his question before about the intermediate result memory let's say that I have some queries where we're doing sequential scans and the I don't want to pay the penalty of having to go look up in the page table and look in my buffer pool to go figure out whether the page I'm looking for is in memory furthermore I also don't want to pollute the cache with maybe reading some data that I'm not going to need in the near future so the buffer pool bypass or buffer cache bypass depending on what system it is you allocate a small amount of memory to the your query, a thread running it and then as it reads pages from disk right if it's not in the buffer pool it has to go to disk to get it rather than putting it in the buffer pool it just puts it in its local memory and then when the query is done all that just gets dropped and thrown away alright and you do this again because you want to avoid the overhead of going to the page table which is you know it's a hash table where it has latches it's not it's not super expensive but it's not free so in FormX these are called light scans but pretty much every single again major database system supports something like this I don't know whether my sql8 does I don't think 5.7 does and then again if you recognize though you only really want to do this if you know the intermediate result or the thing you're scanning is not huge if you're doing a sort that's going to be terabytes of memory then you want to be backed by the buffer pool because that thing can page up the disk as needed alright the last thing to sort of understand also too is what's actually going on below the database system what's happening as we read pages from the operating system what is the operating system actually doing so again all our disk operations are going to be going through the osapi the lowest level f of them f read f write you know we're not going to access the roll disk themselves so because we're now going through the operating system by default the operating system is going to maintain its own separate cache for the file system this is called the os page cache so that means again as I read a page from disk the os is going to keep a copy of it in its file system cache or os page cache and then I'll have another copy of it in my buffer pool so most database systems do not want you to do this do not want the operating system to do this so when you open a file you pass in the flag o direct or direct IO where you have the os not do any of that caching itself and you manage what's in memory on your own so pretty much every single database system when you go read the manual they will tell you that make sure you can actually turn this on the only database system that does this is postgres as far as I know the only major database that relies on the os page cache is postgres and so the reason they said they do this is because they claim that from an engineering standpoint it's one less additional caching thing they have to manage and it still has their own buffer pool but it's not going to be as big it's not going to use all of the memory on the system like mysql or oracle would use they're going to let the os do some additional management themselves so from an engineering perspective it's less overhead on their part from actually maintaining that piece of the system and it's a minor performance penalty to rely on this so I like using postgres for demos because it's almost like a textbook invitation of a database system and it actually exposes a lot of the important concepts that we're talking about pretty easily okay so this is running again a machine back in the lab let me turn on the lights and I type on this laptop because it's a pain to type on the surface I hate the keyboard alright so this is running htop it's a better version of top and the thing I want to focus on is is the memory you should stuff up here so the green bars are telling you what's the resident set size of the processes running on this machine it's the memory they've mallocked the orange bar here that's the file system page cache that's the operating systems page cache again as whatever processes are running on this machine as they go read if they're not using direct IO if they go read a page from a file the OS is also going to cache it as well so we can blow this all away so this is running on linux so in linux we can do a that's not oh four sorry so we can run this command that we basically we pass we sync the OS the file system cache and we pass this flag three into the proc file system to allow us to force the operating system to flush our page cache so now if we go back and look at htop now we see that the the total amount of memory used by the machine went down to three gigs right so it had 32 gigs before but now it's down to three gigs so we can blow away the file system cache entirely okay so now let's go let's go bring up postgres the first thing I want to do though is restart it and so by restarting it we're going to blow away it's it's buffer pool alright so now bring this up and reconnect return on timing and then we'll turn off the parallel threads so we're going to use that same table I showed in the last class 10 million entries of a bunch of decimals so we can run this query and now what I'm going to do is I'm going to use explain again but I'm going to pass in analyze two flags analyze and buffers so analyze again is going to actually run the query and also show you the query plan what happened this buffers flag is going to tell you how much data it read from disk what percentage of the pages it was reading were in the buffer pool versus on disk and so because we blew away the file system cache we blew away the buffer pool because we restarted the database system it should the hit should be 0 and you see that it said that for the buffer pool it had to read 4004 248 pages it had to read the table from disk and it took 1300 milliseconds 1.3 seconds so if I run the same query again now you see it says that the hit was 32 so it was able to read 32 pages that were already in the buffer pool and then the rest it had to then read from disk the reason why it wasn't all the pages is because Postgres maintains a buffer pool sort of a small buffer pool ring per query that's 32 pages so for this one it was allowed to read 32 pages from the last time it ran if I run this again it should go to think to 64 yes so it keeps growing in size as I'm executing the queries over and over again because it recognizes that the data that I need is not my buffer pool let me increase the size of its cache so now what we can do is we can force the database system to put everything into memory so they have this extension in Postgres that comes by default when you install it called pgwarm and all this does is that it's a function that we invoke on the data system and say hey go take all the pages for this table and bring it to our buffer pool right and tells you that I ran I did that and I read 44,248 pages remember when I ran the query the first time it said it had to read 4,428 pages from disk because it's getting exactly that's the number of pages for this table it says those 64 pages are already there so this is like forcing it to read everything and actually I think those 64 pages might have been yeah I think it doesn't look to see what's in memory it just says I'm going to get everything because if I do it again it's going to be the same number yeah just reads everything so now if I go and run that query again I'm doing a little bit better I hit 16,000 again 16,000 pages I needed were in memory so I hit and hit in the buffer pool but I still had to read a bunch from disk let me take a guess why yes why is it still loading everything into the buffer pool depends on the size of the buffer pool right so we can do this in Postgres so Postgres has a flag called shared buffers and it tells me that it's currently set to 128 megabytes right but the size was what 4,4428 so select you can use again I love databases you can use them as a calculator so 4,428 times 8 divided by 1024 that'll give me megabytes so the size of my table I'm reading is 345 megabytes so again the shared buffer is 128 but my size of my table is 345 so I can go to the Postgres configuration in theory 4 this is Postgres 11 and then go find that particular parameter and lo and behold it's 128 megabytes so let me set it to let's be generous let's say 360 megabytes right so now we will restart Postgres we will blow away our file system cache from the operating system because again as we read that page in actually we go back to Htop it got it got a little bit bigger like you can see there's one bar there because that's our table we were reading in so let me go blow away the file system cache and now I need to reconnect turn on timing set that to this turn up parallel threads check to see that shared buffers is now oh I'm an idiot right sorry server 10 client 11 too many Postgres installations sorry so go back here is this put it at the 128 I said what 360 now we start Postgres go back here reconnect 360 okay good turn on timing turn on parallel threads pre-warm again 44248 pages and now I run that query again and now my hit is 44248 so I gave the system the right amount of memory I prefetched everything and now everything is hitting the buffer pool I didn't touch the disk at all for this particular query I need to look up every page I need to access I'm going looking in that page table and finding the page reference in a frame but everything is in memory here so how can we prove that the that the database system Postgres is relying on the file system cache so let's turn off to explain everything here let's see how long it actually takes right it's actually so the first time was 1250 and it got a little faster and then it's 733 right so it takes about 700 milliseconds so what we can do is go restart Postgres and then that blows away the buffer pool and now if I come back and reconnect to Postgres which I think I need to yep so now I'm reconnected I set the good turn on timing turn off parallel threads I run that same query before when everything was out on disk I think it took 1.3 seconds so this one with everything in the buffer pool it took 700 milliseconds so this one should be roughly a little bit the timing was off sorry well that ruined the demo fuck so go back I go back to this restart this go back to this reconnect timing is on now it's on yeah I got it parallel threads are off again so I'm going to run this query I restarted the database system that blows away the buffer pool but the operating system still has its file system cache so now if I run this query we're going to have a bunch of buffer pool misses because nothing is in memory but it's still not going to take the full time right took 800 milliseconds instead of 1.3 seconds because the data that it needed was in the file system cache if I run this again I should get now 700 milliseconds there it goes let's go figure out what happened still reading data from disk why is that it's still running fast that time is slower because I think it's running explain analyze it'll slowly get faster as it increases the cache size for that query so I think it's a query cache rather than the global thing but again the main takeaway we showed is that in memory we put everything into our buffer pool and then we were able to get the full speed performance so any questions yes you pre-warmed twice on the second time you pre-warmed like 30% faster like what is that pre-warmed twice what do you mean oh that's the file system cache that's the OS cache question yes so the first time when you put the entire table in the buffer pool yes so the very first time I did this the buffer pool size was 128 megabytes the table size is 345 megabytes it didn't that's why I had I still had look ups in the read from disk but it said 44,000 already right at the very beginning oh this is not when we were spending all our time this is walkthrough so let's do this go back we're going to blow away the file system cache restart Postgres now we go look in our I mean that that bar is not attributable potentially for for Postgres there's other things running on the system but I blew away the file system cache I restarted Postgres now there's nothing in memory so I go back to Postgres need to reconnect turn off payload threads and so if I run the query now the first time right nothing's in memory I had to read 44,000 pages okay so that's expected pre-warm tells the database system to go read everything that's on disk for that table bring it to my buffer pool exactly like entire 44,000 all 44,000 pages yes I can do this again right it read 44,000 pages now I run the same query and now my hit is exactly 44,000 the thing I was looking for was found in the buffer pool so I forced the database system to bring everything back into memory and the first example I only had 128 megabytes so I couldn't put everything in yes great so the question is so I said in the beginning that Postgres is the only system that relies on the OS page cache why doesn't everybody else do this well because now I'm going to have two copies of every single page potentially so I can have a page in the OS page cache then I'm going to have a copy of that page in my buffer pool because now if I modify that page now it's not exact copy anymore so the OS has the old one and I have the new one so it's redundant data so you're more efficient in terms of memory usage if you manage everything yourself furthermore too think of like in different database systems most data systems support Linux now right but like the major ones they got to support Windows BSD all these different operating systems where the OS page cache may have different performance implications or different policies and so to guarantee consistent performance or consistent behavior across different OSs you can manage everything yourself that's a good question this is the number of pages this is the number of pages but again so like I it's 8 kilobyte pages I take this number multiplied by 8 divided by 1024 that tells me they are megabytes of my thing I set my buffer pool size to that size and now I can guarantee everything fits yes the question is how does the OS buffer pool interact with the OS page cache again the it's like different options then you can go into the OS cache browser the question is like are there different options of how to use it no like so it's transparent to the program like I call read fread to go read a page from the from disk if the OS has a disk cache that serves me that page otherwise it goes out and disk gets it that's all transparent to me if I pass that flag direct IO that tells the OS do not cache anything and it's always going to go to disk and get it so the OS cache is in between the disk and the developer the OS page cache is in between the disk and the database absolutely yes it's going to matter also to a lot when we aren't doing writes if you call like you write a C program you call fwrite and the developer says I'm actually going to write that right away no it puts it in the page cache and at some later point the disk schedule says alright let me go write this out it's only when I call fsync is when it actually gets written but if I want to complete control of how I'm writing everything at the disk I want to use direct IO and most database systems do that yes when you have buffer pool of 128 MB you brought all the 380 360 MB into the buffer pool what would have happened the first 128 MB would have been overwritten now when you did the query you got hit but when you started the query you have started from the starting so you shouldn't have got hit because the memory present was the later 128 MB present I want to get through the thing for the project but let's talk about it afterwards alright so the thing we're going to talk about now quickly is the buffer pool replacement policy so again how to find the page we want based on the page ID and the page table but now in all my examples we had enough memory mostly and so now we want to talk about what happens if I need to bring a page in and I don't have space for it what do I do so the things we're going to care about in a replacement policy are obviously correctness we don't want to write out data or evict data that someone pinned before they're actually done with it we're going to care about accuracy because we want to make sure that we evict pages that are very unlikely to be used in the future we want our replacement policy to be fast because we don't as we're doing a lookup in the page table we're holding latches and we don't want to have to run some empty complete algorithm to figure out what page to evict because that might take longer than actually reading the page anyway and of course obviously we don't want to have a lot of metadata overhead of keeping track of all this additional data we don't want to have the metadata for a page keep track of how likely it's going to be used to be larger in the page itself so these replacement policies again as another good example what distinguishes between the high end very expensive enterprise databases and the open source guys because the high end ones have very sophisticated replacement policies they track statistics of how pages are being used they try to extrapolate from what the queries are actually doing and to try to make the best decision whereas in the the open source guys the newer systems not saying they're bad but they don't have $10 and decades spent trying to make this thing run fast as possible and so they'll do something more simple which is what we're going to talk about here this is like one of the oldest problems in CS like everybody and their uncle has a paper over the years on how to do caching and things like that I have one right like this is like one of the oldest problems in computer science and so there's a ton of a long history of this alright so the easiest technique to use and pretty much everyone does the first time is LRU or at least recently used so all we do here is keep track of the time stamp of when the last time a page was accessed and then when we have to go figure out what page to go evict we just look to see which page has the oldest time stamp and that's the one we go ahead and remove so the way to speed this up instead of just keeping track of a time stamp for a page because then we have to do a sequential scan across all our pages in the buffer pool to figure out which one has the lowest time stamp we can just maintain a separate data structure like a queue that's sorted by their time stamps so anytime somebody reads and writes a page we just pull it out of the queue and put it back to the end because it's a first and first out what you guys will have to implement in the project is an approximation of LRU called clock actually quick show of hands who here has heard of clock before nobody awesome okay who here I mean LRU I should know right okay good so clock LRU is an exact least recently used clock is an approximation of this where you don't have to track the time stamp exactly every for every single page so instead all the only information we need to keep track of is a single reference bit per page that tells you whether that page was accessed since the last time you checked it so you're going to organize your pages in a circular buffer like a clock and then you have a clock hand that goes around does sweeps and check to see whether that reference bit is set to zero and if it's set to zero then you know it hasn't accessed since the last time you checked it and therefore it can be evicted right so say I page pages one two three four and again each one has their own reference bit in the very beginning the reference bit is set to zero so let's say that some some query accesses page one so I'm going to go ahead and flip its reference bit to one and no matter how many times so many accesses of this this page it's always set to one it's not a counter so now now I need to evict the page because I don't have any more space so my clock hand is going to start with this first one I see that its reference bit is set to one and therefore it's been accessed and therefore I should not evict it but now I reset its reference bit to zero and then go on to the next one and I'm going to sweep around if I come back around and set to zero then I know I can evict it so this guy here his bit is set to zero so we can go ahead and evict it remove it and replace it with a new page and then we don't set its reference bit to one we set it to zero and then move on to the next one so let's say now page three and four have been accessed so we check that reset it to zero check that reset to zero now we come back to page one which is the first one we checked its reference bit was zero since the last time we checked so therefore it can be evicted so again reason why this is an approximation is because as I'm evicting pages I'm not evicting exactly the one that's the most least recently used it's just saying within some time window these pages have not been used and therefore it's I can go ahead and evict them and the intuition here is that if the page has been used in a while then it's probably not going to be used again in the near future so therefore it's something I can go ahead and evict right so that assumption works well for simple things like doing point queries to go access single things but clock and LRU are susceptible to what is called sequential flooding and what this means is that when we have a sequential scan that's going to read every single page that's going to pollute our page cache and that's going to end up having we can end up evicting pages that maybe we do really want that are going to use rate in the near future but because that scan read a bunch of pages all those pages are going to have newer time stamps than the page I actually do want right in this case here the most recently used page is actually the one I want to evict not the least recently used so this is another good example where if you can have different buffer pools or different tables based on how queries are going to access them maybe one I want to use most recently used and another one I want to use least recently used so let's look at an example let's say I have one query that's doing a point lookup where ID equals one and it reads page zero so I go ahead and fetch that into my buffer pool and I'm fine so now I have another query that's going to do a sequential scan so it's going to rip through all my pages and then once I make space for page three if again we're using least recently used then it would figure out that page zero is the least recently used let me go ahead and evict that and put in page three but in my workload I'm executing queries that look like the first one over and over again so now if I execute this query all over again now I read page zero I just evicted it and now I'm screwed because now I've got to go out and disk and get it so what I really should have done is evicted one or two because this scan is going to go through and read more data and it's unlikely that anybody else is going to come and read this thing here so the there's three ways to get around this we sort of covered some of these so far so the first is to do it's called LRUK where K is just to keep track of the number of times multiple time stamps every single time this page is accessed so now when you want to say which one should I remove you don't look to see which one has the lowest time stamp you go look at the intervals between those time stamps and you say which one has the longest amount of time between one access to the next access and they can use that to figure out which one is at least likely to be used so this because we're using the history again to estimate when it's going to be accessed again to help us make a better decision about what pages should be evicted so LRUK is what's used in the more sophisticated data systems will do something like this I don't know I think MySQL might use this I don't remember the next optimization we can do which we sort of already talked about with having multiple buffer pools is to have localization per query so rather than have that you know as I'm scanning the table I'm putting it into the global buffer pool if I have a small little set aside some pages in the buffer pool that are specific to my query anybody can still read them but it's I'm keeping track of how I'm using pages so then then when I want to make a decision on what to evict from my query I evict the ones that are least recently used for me not the global view so we saw this in Postgres Postgres had that the hit was like 32 then with the 64 right that's this little ring buffer that they're keeping track of what pages that that query is accessing to make decisions on what to evict alright the last one is do priority hints again this is where we talked about before when we have indexes we know how they're scanning data know what pages they're going to access so we can use that information to make decisions about what to evict so let's say we have our B plus tree or whatever tree data structure we want and they have a bunch of queries that can insert data where there's a global counter for this table or just incrementing it by one and inserting over and over again like a serial key or auto increment key so if we're now sorted on this index is sorted on ID from min to max we know that every single time we do an insert the ID value is always going to be one more than the last one we disinserted so that means we're always going to be going down the right side of the tree and touching these pages so therefore we should have hints up into the buffer management say these pages should try to stay in memory I don't care about these so much about these other ones here or likewise if I have a query that does lookups on different IDs actually any query that does a lookup on this index I know I'm always going to be going through the root page because that's how I enter this index I have to go through that so therefore I want to make sure that's always pinned in memory that always stays there because otherwise if I get to the bottom and I need space and I pick this thing that's a bad idea because that's the least recently used but I know that the next query is going to come through and go to exactly through that page so again this is what the commercial systems can do provide this extra information up above the last thing to talk about is how do we actually handle dirty pages so remember there's a dirty bit on the page that says whether a query has modified the contents of that page since the last time since it was brought into the buffer pool so when we now make a decision on what page to evict or bring a new page in the fastest thing we could do is just find a page that's not marked dirty and immediately just drop it and use its frame for a new buffer pool the slower thing we have to do is if a page is dirty we have to write it back out to disk safely before we can reuse that space for our new page so now there's this trade off we have to make in our replacement policy decide well there's a bunch of pages that are all clean and I could drop them super easily but they actually may be needed in the near future so I don't want to actually drop them instead I want to pay the penalty to write out a dirty page flush it, remove it from my buffer pool and reuse its space so how you actually balance them is super hard because again in this case here to do a disk read, if I had to write out a dirty page it's two disk IOs, one IO to write out the dirty page then remove it from the buffer pool and then another IO to read a page that I want in this case here it's one IO to just go read the page that I want because I can drop the page that's already in the buffer pool so how you actually figure that out again it's super hard and this is what the commercial systems in my opinion do better than the open source ones so a way to get around this to avoid that the problem of having to write a page out as soon as I need it to free space in my buffer pool I can do background writing so periodically the data system is going to have a thread it's going to look through my buffer pool figure out what pages are marked dirty and just write them out the disk so that way I can flip them to be marked as clean and now when I do run my replacement policy to decide what page to remove I have a bunch of clean pages I can I can drop right away so you got to be careful when you do this because you don't want to write out dirty pages before the log records that correspond to modifying them to make them dirty you want to make sure the log records are written out the disk first before you write out the dirty pages we'll have a whole lecture on why that's the case later on in the semester but it's not like I can blindly write any page I want I have to do some extra steps protections to make sure I'm writing things in the right order and this is something that M-Map cannot do alright so I'm going to skip this for the other memory pulls we've already sort of covered this it's more than just the pages from tables or indexes there's when we run queries we also need to generate some information alright so again the the whole point of this lecture was to talk about how we can manage memory better than the OS because we know what queries are doing we know what's in the pages we know how things are being accessed and we can make better decisions and essentially we're going to use information on what in the query to you know for all these different things that we talked about a bunch of different optimizations we can apply to help us make this work better alright so any questions about buffer pool alright here's what you really care about project one right so for the first project you're going to be building your own buffer pool manager and replacement policy so this will all be done in our new database system called bus hub which is it's an open source system it's disk based again you will see this will be stub files in the code that you would download from the GitHub that will clearly show here's the function you need to write and here's how to actually implement what we're asking you to do so the project is the write up is available online the great scope isn't then set up yeah we'll do that later today but if you can finish the project in a single day come talk to me because we want to do other things so we are going to already provide you the disk manager and already the page layouts so you don't have to worry about that we'll give you a page block of pages and it's up for you to decide how to store them in memory and invoke the disk manager to write them out as needed so for the first one we have a separate class called clock replacer and you'll be implementing the clock policy that I talked about here today again it's an approximation of LRU we just sweep the hand and flip these reference bits so that means you need to keep track of as pages are being accessed because you'll see this in the buffer pool API you have to know that when I say read a page or write a page that you go update the reference bit inside of your clock replacer so the one thing important to know is that if you do a sweep and all the pages have been were modified then you just pick whatever one has the lowest frame ID if all the pages are pinned and you can't free one then you pick one with the lowest page ID because otherwise you just spin forever this will be in the write up the major effort will be on the buffer pool manager so you'll implement the clock replacer algorithm first and then you hook that into your buffer pool manager and for this one it's up for you to decide how you actually want to maintain your memory how you decide what internal data structure do you want to keep track of what pages are available what pages are dirty what pages are being pinned you can do whatever you want you have to implement the API that we exposed to you so the thing that always tricks up students every year is to make sure you get the ordering of the operations of how to pin pages correct so we'll do multi-threading graded tests we'll try to read a page and pin it at the same time and make sure that everything turns out in the right order and this will be more clear when you look at the write up and see what we're asking you to do so how to get started so again everything is available on GitHub you want to go to your if you don't have a GitHub account sign up one it's free there's also I think an educational one that you get extra stuff but basically you'll go to the GitHub page for the database system there'll be a little fork button you fork it into your private repo fork it into your own repo, market is private so nothing's public and then just do all your changes in there if you sign up for the GitHub account you can get free private forked repos because if you put everything in public then other students can see what you're doing so the very first thing you should try to do today or tomorrow as soon as possible will be super helpful try to get the software to build on whatever machine you're going to do your development on so it works on Ubuntu, it works on OS X it works on Windows with the Windows server Linux, whatever this package you can download and install the thing though for OS X is not going to support the Clang formatting stuff that I'll talk about in a second so Gradescape will run this for you you can run it in Docker this is a problem we can also give you a VM image but you'll have to figure this out on your own we'll have instructions to try to help this out it does not compile on Android machine we tried it, it doesn't work the software they have on there is too slow if this is a problem, if you don't have your own laptop please email me and we'll figure something out so things to note, do not change any file other than what you must hand in there are four files you have to turn in blow everything else away and plop your code on top of the latest version of the system and run all your tests the projects are accumulated, meaning if you bomb this one you're going to have problems later on because the next project is actually going to use the buffer pool manager that you built today or build now we're also not going to be providing solutions at the beginning and then we're not going to help you debug your code on Piazza another thing we're doing here is that we're requiring you to write good looking code normally people write and so now we have a bunch of checks to make sure it actually conforms to a good style guide so we followed the Google C++ style guide and we also followed the Java doc style guide so we have checks already in place that will check all these things for you like if you call make format, it'll make sure your code looks pretty in the C++ style guide but there's a bunch of other things like how you allocate memory how you set up your for loops and so forth that we use ClangTidy and ClangFormat to enforce more detail so you'll run these commands like check ClangTidy, click check censored, check Lint it will throw errors, it won't correct it for you it will throw errors and say your code looks crappy here's how to fix it and we're going to run those in grade scope so when you turn it in, if you write crappy code you'll get a zero score because you'll fail these tests so this is what I was saying, so Linux and Windows I think this works for OSX we can provide you a VM you can do all your development in there last thing, don't plagiarize we will run your code through Moss there's some people in China that take the code and have already implemented some stuff there's all crap, we've run it, it doesn't work don't make your stuff again don't put your stuff on the public repo because if we run your stuff and someone copies from you because your account was public we run them in Moss and you both come up as being duplicates of each other I don't know who stole from who but we're going to fail so don't put any of your code public you can do this in the semester because I know you want to go in the job market and be like here's what I did in this class truth be told, no one's actually going to care because everyone's implementing the same thing it's not like an independent study where you make some good breakthrough so employers, they don't care that much that you have your project online but if you want to do it in the semester we're fine with that any questions? thank you thank you