 All right, let's go for it. This is the last lecture of the class. And for this one, for everything we talked about before, it's all hardware you can buy today and actually use. And so you've wanted to go build a database system and doing the stuff we talked about, you could do that. So for today is the only sort of futuristic one where we're talking about storage devices that don't actually exist yet. But we have some early prototypes that we've been using here at CMU. So we can talk a little bit about how we think this is all going to work out. So for today, we're going to talk about the background of non-volatile memory. And then we'll talk about the paper that you guys were assigned to read that Joy and I wrote last year about developing different storage and recovery methods inside of a database architecture that are designed to use NVM correctly. And I'll show what I mean by correctly as we go along. And then we'll finish up with talking about what will be expected for all of the groups in the class when you do your code reviews of another group. So I'll talk about what I expect for you guys to do and some basic outline of the kind of questions you should be asking or thinking about when you look at people's code. So for non-volatile memory, the basic idea that way you can think about it is that this future storage devices that are coming along that are going to have almost the same read and write speeds as DRAM. But they're actually going to be persistent and durable like an SSD and have larger capacities. DRAM is obviously a transient memory. Like you pull power, you lose everything. SSDs, they're durable, but they're much, much slower. So the idea is to sort of bridge that gap and have be able to do low-level reason price very quickly. And so the thing that makes it kind of difficult or confusing about non-volatile memory is that there's no sort of standard terminology that all of the different companies and industries use. So sometimes they'll say storage class memory. Sometimes they'll say persistent memory. Sometimes they'll say MVRAM. But generally this is all the same idea. It's going to be as if it was DRAM, but you can ensure that your writes are durable. So what's going to happen is the first devices that are going to come out are going to be essentially like the Fusion IO cards or the new sort of high-end Samsung flash cards. It's going to be a PCI Express interface. And then you use this sort of standard API that they developed, a bunch of different manufacturers developed called NVMe to do block-based reason writes to the device. So again, this is what's sort of confusing about the terminology is the API or the protocol you're going to use to access these things is called NVMe. But you can use that today based on flash devices that you may not consider, what I wouldn't consider being non-volatile memory. So the NAND flash is different than what we're talking about here, even though, again, you use the NVMe API. And then what will happen is in a few years later is the next set of devices are going to come out that are going to be byte-addressable and fit into the actual DIMM slots where DRAM goes today and have them be much, much closer to the CPU. And therefore, you don't have to pay this huge latency of going back and forth between the CPU and the PCI Express bus. And hopefully, everything will be durable. So I'm not really going to talk about what is needed to make this last one a reality in terms of, from a software standpoint, and the operating system. So you can imagine there's all this history and years of people writing code assuming memory is volatile. And now you throw in memory addresses that are non-volatile. So how do you actually program to that? So I'm going to ignore all that for now. And we're just going to focus on later in the class how we actually build a database system to use it, assuming that we have the operating system support to make this work. So before I can get into talking about how we do this from the database system, I want to spend a little bit of time talking about what this technology actually looks like. Because I find the story how it came about in the recent years, I find, is very fascinating. So I'm not an electrical engineer. And I've never taken an electrical engineering course. But through my travels on the internet or Wikipedia, you learn a little bit about these things. So if you had to take an electrical engineering course today, they would teach you about the fundamental elements of passive electronic circuits. And in particular, they're going to talk about three fundamental things. The first is a capacitor. And this is essentially your battery. This is the thing that can store some charge, and you can have it released. And this was actually invented or discovered way back in 1745, I think, in England, as when they figured out you could have this kind of device. The next thing you have is a resistor. And that came about in 1827. And this is basically the thing in your circuit that you can allow you to modify or change the current of the voltage going across the wire based on what the resistance is you set in this thing. Then soon after that, they came out with the inductor. And this is basically like heating coils. You can allow the circuit to convert the current into heat or energy outside the circuit. So for 150 years, this is pretty much all we had. And what happened was in 1971, there was a new professor at Berkeley in electrical engineering named Leon Chua. And he wrote this paper where he did the math, and he predicted there was actually a fourth fundamental element of circuits that has yet to be discovered. So basically what happened is he worked out the math. And he saw that the way the equations worked out is that you had to have this fourth element. And you couldn't build it from the other three fundamental elements. So it had to be a low-level primitive in itself. So essentially what he predicted is that it was going to be a two terminal device, so the input and output, where you can change the resistance of the device based on what voltage you apply to it. So think of it like a resistor, but you can modify what its resistance is based on what other current you give into it. But then the key thing about it is that what happens is if you turn off this voltage you're using to set the resistance, the device will remember or persist what that resistant was forever. So think of it like a resistor. You have a little knob. You can change it. But it doesn't require you to have power all the time in order to make that knob stick. So he wrote this paper in 1971, and it's very theoretical, very mathematical, and no one ever read it because no one could understand it. And so then what happened was no one knew about this. So then in the 2000s, what he's essentially predicting is what we call now the memrister. The memrister is going to be the fourth element of all the fundamental circuits. So again, this paper came out. No one read it, and no one really knew about it. And then in the early 2000s there was a team at HV Labs led by Stanley Williams that was trying to do some far-reaching research and developing low-level nano computing devices that were self-assembling. And what happened was they created this nano device that had these weird electrical properties that they couldn't actually really understand. They didn't know why it was behaving the way it would behave. You would put a voltage in, and you would expect to get one voltage out, but you would get something else. And they sort of had this thing for a couple years, and they didn't really understand why it was doing this thing and what was actually going on. And then they just happened to stumble upon a few years later. Chua's 1971 paper, and then they finally realized what they had invented was the memrister, which is pretty crazy. And so they had this big announcement in 2008 saying that we found this last fourth circuit, and we can actually manufacture it. And so just to be clear, I have a link here to the Stanley Williams HP page, and make sure you get the right one. There's two Stanley Williams. Back when we started this project, I had Joy try to reach out to this guy and talk to him. Joy ended up contacting this guy. This is a death row inmate in LA, who's actually one of the founders of the West Side Crips. So you don't want this one. This is the real Stanley Williams. So when you Google Stanley Williams, make sure you get that one. So what was really kind of cool about the memrister stuff is when they announced that they discovered it, they actually went back into the old journals and annals of scientific publications of the last century, and they kept finding examples of other memristers. So this graph here is sort of an ideal case or a simulation of what a memrister would actually do on the circuit. You change the current, you change the voltage, and it changes what the resistance is, and you change the current. So you see this thing called a hysteresis loop. And so what they did is they went back to all these old publications, and they found examples of people basically drawing the same kind of hysteresis loop that would have the same properties when they were doing other kind of experiments. But they didn't know what it was. They didn't know about the memrister yet. They didn't know why it was doing these things. It was just having this unusual property, and there's all these papers that show these examples. So this one is testing out different metal alloys inside of vacuum circuits. There's even papers from the 1920s, 1930s, again, where they chart this kind of loop, and they say, we don't know why it does this. Here's just something interesting we found. But in fact, they were inventing memristers, but they didn't know what it was at the time. So there's this paper here, Two Centuries of Memristers, the other one too about reading how the HP guys found it. I absolutely encourage you to read that. I think it's absolutely fascinating what they were doing. We'll talk a little bit why I totally drank the Kool-Aid about memristers are going to save the world. That was back in 2008, 2009, and now we're 2016, and it's still not here yet. But we'll talk about that later. So here's another thing that makes it tricky about understanding this non-volatile memory technology. So HP invented this memrister and announced that they had found it, but the memrister is not one particular type of technology. It's this class of circuits, the class of storage technologies. So I'm going to go through the three major emerging technologies that are going to come out for non-volatile memory, and all of these can be classified as memristers, even though the resistive RAM one is the one from HP, and that's what they're marketing as the memrister. So they're all memristers, but HP markets memrister for their device. But you can also think of it as resistive RAM. So what was sort of touted is going to be one of these first storage devices to come out of the main technology was this stuff called phase change memory. The basic idea of the way it works is that you have sort of two metal electrodes at the top and the bottom here, and then you have this calcogenide here that can be, you can modify its state to be either like a 0, 1, its resistance on the wire based on what form it's in, and you can change its form to be sort of opaque or transparent if you heat it up. So what happens is if you want to set this to a 0, then the heater is this line going into it, you give it a quick burst of energy, and then that will reset it to 0, but then if you give it a gradual increase of energy, that will change its structure to become a 1. So it's obviously not really like, it's not a little flame in there, it's just a little wire that gives it a little shock of energy. So this was sort of believed to be one of the first non-multi-member devices that were going to come out, but this sort of has as you can imagine, a bunch of manufacturing and heating problems because you have to use a lot of energy to change this device. So we're not really going to talk about the form factors of these storage technologies. But one of the things that's going to happen in the future is, they're going to start moving more and more stored, more memory directly on the CPUs, it's going to 3D stacking on top of it, because you want to reduce the latency of having to go out the memory controller. So in the case of PCM and certainly DRAM, you can't put these things directly on the device because it just uses too much energy, it's going to generate too much heat and fry the CPU. So this thing, again, IBM was pushing this, Intel was giving over-chores that this is what they were going to do. But I think you can get like 120 megabytes of this now, but they haven't really ramped up manufacturing, you can't get this in large quantities yet. The next one that I find the most interesting is the resistant RAM, and again this is what HP calls their memristors. The basic idea way this works is that you have two layers of platinum or some metal, and then you have a titanium dioxide layers in between. So it's actually two layers of titanium dioxide where the bottom layer has all the electrons it's supposed to have, and then this upper layer is missing a few of them. Then based on what direction the current you put in it, some electrons move up and down, and that changes the resistive properties of the circuit up above. So you run the current in one way, electrons go down, you get a zero, run another way, electrons go up, and you get a one, right? It's the basic idea how it works. What's really awesome about this, and again why I think the memristor stuff, when the story came out with this was amazing, titanium dioxide is the same crap they put in white house paint, or it's the same thing they put in like sunscreen. So it's super common, it's super cheap. And they were talking about they can make these memristors to be one petabyte per square centimeter, some crazy numbers, right? And of course that hasn't happened yet because they've been having troubles manufacturing this. Another really thing that's really awesome about memristors, if you believe what HP people say, and again if you read this article here, how we found the missing memristor, and watch the video from Stanley Williams, the right one, they talk about how you can actually turn the storage fabric itself into programmable execution logic, or executable gates. So you can think of it as like, you can turn half of your DRAM into an actual FPGA, and you can program it into computer computations you want directly on data that's on the same storage device, right? So they talk about how like yeah, this is like we can build neural networks on this thing, and we can basically simulate the brain well beyond what we can do today, all directly in memristors, right? And of course now when like HP announces their big memristor machine called the machine, they don't talk about this part. You know, I haven't really seen too much literature about this, right? They're all about just having non-abautical storage. But if this happens, I think this would be amazing. And what's also really cool about it too, it doesn't use the same sort of NAND executable logic gates that our CMOS chips use today. It actually uses a different type of logic called material implication logic that was invented by Bertrand Russell, who's like a famous philosopher and like an awesome dude from like 1910. So you're using like 1910 mathematics to program, you know, 21st century memristors. I think that's amazing. Of course now, again, HP says they're gonna put this in their new thing called the machine, but I think it was that a year ago they announced that the first version of the machine aren't gonna have any of this memristor stuff. It's just gonna have regular DRAM. So who knows when those will happen. And then the last one that's much farther out than PCM or resistive RAM is the Magneto resistive RAM. And this is actually not a new idea. It's been around since the 1990s, but they have a new technology called Spin Transfer Torque and RAM, or STT MMM RAM, that is supposedly gonna be how they're gonna be able to implement this in the future and get really good scale beyond what PCM or resistive RAM can do. And the basically way it works is that you have basically two magnets, right? So these ferromagnetic materials, you have one at the top that's fixed. I mean, its polarity is always going in one direction. And then you have one at the bottom when you can switch it, right? If you send a little current and you can switch its polarity back and forth. And then that gives you the zero and one. And so what they claim is that this STT RAM, what you'd be able to scale to be much smaller sizes than what you can do with PCM and the resistive RAM stuff. And they also claim that it'll have latencies that are closer to SRAM, right? And you're L1, L2, L3 cache instead of DRAM. So it'd be like you'd get rid of maybe L3 and get rid of DRAM entirely and you can have your entire system built out of this stuff. As far as you know, this is way farther out. This is like maybe 10, 15, 20 years, right? Who's they? I think there's a couple of smaller startups that are doing this. I don't know whether the big companies have anything. Yeah, I think they all have little labs. But they don't really publish exactly what they're doing. Yeah, I think he's right. Samsung is one of the major ones looking at this. Okay. So again, this is just sort of high-level overview of what the technology looks like and how it actually works. Nope, sorry. And basically what's gonna happen in the future is Intel announced that they'll have their new 3D crosspoint, which I shouldn't say anything in the wrong video, which is an okay name. They're announced that they're gonna have the NVME 3D crosspoint drives, like the PCI Express drives, later in 2016. And then through our various travels, we've heard rumors that in 2017, they will update the Xeon instruction set to now include the ability to deal with NVM DIMMs. And I'll see in a second what you have to do to make sure that you can do your rights in memory to end up being durable. So these new instructions will end up in the 2017 update. Doesn't necessarily mean we'll have NVM DIMMs available right away. It just means that the CPU will be able to handle it when they finally come out. Samsung has announced in last year in 2015 that they partnered with this other company called Netist, and they're looking to develop their NVDIMP storage technology. So they're not saying which of the three device or technologies, PCM, RERAM, or the magnetic stuff. They don't say what they're exactly doing. They're just looking in this area now. So they have sort of NVDIMMs that are basically DRAM backed by Flash. But to have a truly NVM only storage technology, they haven't announced anything yet. And in the case of HP's memory store, RERAM, it's always two years away. In 2008, Stanley Williams came out and said, oh yeah, about 2010 we'll have it. 2010 was always two years. 2014 was two years. Now 2016, they're saying two more years, right? Just to give you an idea, show you that I'm not kidding. This is a photo somebody took from the HP Labs, a big announcement, I think in 2010. And here they're talking, 2008, yeah, they're gonna have their thing ready and then they say it's gonna be two years later. We'll be able to run memory stores and it hasn't happened, right? It's always two years for these guys. So we'll see. Okay, so how is this gonna affect our database system? So as I said, the first devices that are gonna come out are these can be these NVME storage cards on PCI Express and they're gonna be block addressable. And from our thinking it through, we suspect that the block addressable NVM will be not that interesting for database systems, right? They're essentially just gonna be faster SSDs. Joy is looking at it and doing some research on how to maybe improve logging. Still when you have a block addressable device, but it's not gonna be, it's just gonna be extension of possibly what already has been done. There's not gonna be really fundamental changes. So when we have the byte addressable non-volatile memory, that's when we think things are gonna be dramatically different. And that's gonna be a big game changer for database system architectures, but we have to make sure that we're gonna end up using, we have to use NVM correctly. And as we'll see what I mean, we'll show you what I mean with that in a second. And so it is my hunch that when this byte addressable stuff comes out, the in-memory database systems, the kind that we've been talking about in this class, they will be better positioned to use NVM correctly and efficiently when it's byte addressable more so than the disk-oriented guys, right? In the disk-oriented systems, they assume that memory is volatile and they have a block-based storage device that can put their slot of pages down and they'll have to maintain this buffer pool to pull things in and out. But now you have this problem of, I'm writing into my buffer pool, I'm writing into a slot in my buffer pool, that can be made durable, but the database system is not gonna treat that as a durable write. It's still gonna have to do the mechanism to write it to a log and write it out to disk later on. These systems are not gonna be able to be no or that cannot be changed very easily to deal with non-volatile memory and use it correctly, whereas the in-memory systems, because we're already dealing with pointers and other things, as long as we make sure those pointers are consistent on restart, then we'll get the better performance. And I'll say some things offline about what I've heard from commercial systems about this. And then the other thing that is important to consider also too is my intuition that non-volatile memory, the byte addressable non-volatile memory stuff will be only to have a major impact for all the to be applications, all the to be workloads. We saw last class when we talked about larger than memory databases, for OLAP there's not that much you can do because you're just streaming things off of durable storage and you really can't take advantage of any aspect of the actual workload itself. So in the same ways, in an OLAP system, I suspect that these systems will just get faster, but you don't really fundamentally change the architecture. Yes? So the fact that for OLAP, it's the byte addressable NEM matters a lot, have you considered that you can just battery back a DRAM and that's going to basically give you, I mean, what's the trade off? Yeah, so his statement is, what if you just had a battery backed up DRAM? All right, so you can buy DRAM now today that has a super capacitor on it so that when power is cut, that the system has enough juice to write it out so it's a flash. So yeah, absolutely. So this idea of battery backed up DRAM in the ADB system is not new. This paper is going back to 1988, 1989, that do this. For this particular paper we talked about here, well, I'll show you what the things you have to do to make sure that still works, right? But again, one of the things that you couldn't do back in the 80s we can do now, or at least when the Xeon gets updated, is make sure that when we write to a CPU cache that we can flush it out to storage, right? Because the issue is, yes, you have battery backed up DRAM, but then you come back online and what the hell are you looking at, right? And also say too that we're not gonna be able to do this magic thing where you pull the plug and then you plug it back in and it's exactly as the way it was before, right? Because there's all the low level registers, there's just the program counters and things like that. All that gets blown away, it's just that we do not have to pay this big penalty to replay the log and load in snapshots as before. So again, I think the byte address will stop is gonna be more significant for OTP because that's where we're actually doing writes, right? And that's where having persistent memory can make a big difference. Did I lose this? Okay, so this sort of leads into the paper that you guys are assigned to read. And again, this is a paper that Joy and I wrote last year that was published in SIGMOD. And our basic goal for this was to try to understand how a database system would behave or interact with NVM in a system where it only has NVM. So this is sort of me as a new professor trying to be very forward thinking and this is like saying we're 10 years from now if DRAM goes away and you only had NVM regardless of what technology it's actually using, how would you actually wanna design and your database system to use it correctly? And we focus on the storage recovery methods because that's sort of where the main bottleneck or the main slowdown is gonna be for storage, right? If everything's in memory and you don't care about persistence, there's not much, you know, there's not anything different with NVM versus DRAM. But when you have NVM and you can be a bit smarter about it, we think we can speed those things up. So to do this, we developed a prototype database system called NSTOR that was sort of a stripped down database testbed that had a pluggable storage architecture that's gonna allow us to implement a bunch of different engines in a single system and not worry about the high level features of like the SQL parser or transactions and just focus on measuring what the performance is of the storage and recovery mechanisms in the system. So before we get to what these database storage engines are, we need to talk about what we need in our system in our execution environment to be able to use NVM correctly and ensure that we have durability. And so the first issue we have to deal with is the synchronization of our rights to memory. So as I said, all the existing program models, like they're all based on the Von Neumann architecture, right? So you have this DRAM that's ephemeral, that's transient, that's volatile. You do all rights there and then if you wanna make sure that anything's persistent, you then have to write it out to disk. But now if we have non-volta memory, we can do any right into memory and we wanna make sure that's guaranteed, that's durable. The problem is there's this CPU, there's CPU has caches in between the NVM device and in our process where we're doing the rights and the CPU can decide at any time that it wants to move things out to NVM. Or we also need to make sure that if we do a right and our transaction commits, we wanna flush out those changes to NVM and not have them sitting in our CPU cache, right? So sort of like this, I do my store into L1, L2 and before I can tell the application that their transaction committed, I wanna make sure that it's out and safe and durable on the NVM. So we're gonna need some way to ensure this in our system and right now, operating systems like Linux and whatnot do not provide this. So we're gonna have to develop our own memory allocator. The other problem we have is we need the ability to restart our process and make sure that all our, when we start looking at our memory, our table heap and data structures for our database system, we wanna make sure that they're pointing to valid pieces of data, right? Because we start a program and the first time we're using virtual memory addresses to point to different things. Now we crash and come back and we can restore the state of our application because it's already an NVM. We wanna make sure that all our indexes are pointing to the correct things. And likewise, if we're using something like MVCC, if we have multiple versions where we embed pointers in the actual tuples themselves, we wanna make sure that they're also pointing to valid things, right? So when the thing crashes, what we wanna be able to do is come back and just restore all of this in some kind of smart ways. We need a way to be able to identify that like this is some offset, this is some location in memory and that's what we're pointing to. So in regards to where it is in physical memory, the virtual memory addresses are all still the same. So to make this work, we had to develop our own NVM memory allocator. So you can think of this as a replacement for like malloc. And it's gonna provide these two basic primitives. The first primitive is sort of a synchronization function that exposes a sort of function call like M-Sync or F-Sync to the application that forces the allocator to write back any cache line in the CPU to out to NVM. And you use this new instruction that Intel added recently called CLFlush or CacheLineFlush. And then you issue an S-Fence instruction to have the database process wait until the CPU comes back and says, yes, this data in this cache line is now durable in NVM. And then the second thing we're gonna provide is a naming primitive that allows us to assign virtual memory addresses with a marker so that when we come back online and we have that same marker, we get the same data regardless of where it is in physical memory. So this is sort of going beyond what malloc provides you today when we have to extend this because we want to make sure we're using NVM correctly. Yes. So when you say, regardless of where it is in the physical memory, the physical memory is not changing, right? That's the... Sorry, sorry, virtual memory, right? Yes, because your process starts up and it can be anywhere in virtual memory. Yes, physical memory is always gonna be the same. It's gonna have a virtual memory address. Correct, yes. So why do you want the virtual memory address to stay the same? You want the mapping to stay the same, right? Yeah, do you want the virtual address mapping? Like you give it a logical name and it knows that I saw that logical name before. Yes, I should back up. To do this, the allocator is maintaining some metadata internally to keep track of this mapping, right? And so when you come back, come back, it has to restore this metadata. And we have to do all the same things to make sure the metadata is persistent and durable and atomic in our allocator when it writes it out to NVM. So when our process starts back up, we know that we were this process before and here's the metadata for our process. And then here's how to map the markers to the virtual memory addresses. So everything comes back exactly the way it was before. And we can use this primitive to now build some more complicated data structures on top of this, like linkless and B-trees and things like that. So that again, when we crash and we start, all the pointers are pointing to the valid thing. All right, yes. Yeah, so by the page table is a little story in NVM. So that indicates that the page table will be changed after it restarts. So the statement is the page table will start in NVM. So that means that the page table will be the same after restart. Yes, yes. Okay. All right, so now if we have these primitives, we can now build our database engines. So for this paper, we built the three sort of canonical engines you would have in a real system today. So the first one is a support in place updates. And this is essentially what we've been talking about so far. We have a table heap with tuples and then anytime you want to modify in something in the table heap, you just overwrite the existing tuple. Right, we're not doing MVCC. And then we also update a write ahead log for the delta that change and then periodically we'll take snapshots and write them out the durable storage as well. So this is essentially the same architecture that's used in a store of both DB. The second one is a copy and write architecture and this is where you don't have a log at all. You only have the table heap and the table heap is being organized as a B tree or B plus tree. And we'll use sort of essentially shadow paging to make additional copies of the nodes in the B tree. And when a transaction commits, we just flip a little pointer to now point to the new version. I'll show what that looks like in a second. The key thing about this one, it doesn't have a write ahead log. And this is exemplified by the LMDB system. And the last one is the log structured architecture where we don't have a table heap, we only have a write ahead log and all changes get appended to them. We're using the same kind of leveled architecture that's used in level DB or rocks DB for this. So I'm gonna go through each of these engines one by one and then I'll show you how you can modify and improve them to use the our MVM aware allocator to get better performance and have and reduce the wear down on the device. So in the case of the in place updates engine, as I said, you have in memory index, you have it in memory table heap and then you have a durable storage, you have a write ahead log and snapshots. So say if you wanna modify this tuple here, the first write we're gonna have to do is write the delta to the write ahead log. The second write would be the change to the actual tuple itself. And then eventually later on, the database system will make a third write out to write the snapshot of the updated tuple. And so if you think about this again in the context of an MVM system, we don't have anywhere DRAM, everything's an MVM, right? So if we persist our change here, we don't really need the write ahead log potentially, right? And we potentially also don't need the snapshot because if this long, this is durable, our change is there, right? So the first problem is we have duplicate data and then we also have a longer recovery latency because we have to follow that same protocol before where we replay the snapshots or load the last snapshot in and then replay the log. So all this duplicate data is unnecessary if we're careful about how we do our writes into the table heap. So to optimize this architecture, we're gonna leverage the allocator's non-volatile memory pointers to only record what items were changed rather than how they were changed. So meaning I don't need to store deltas anymore, I just need to store a pointer to say here's the thing that got modified and at this point when it's flushed to the log, I know that my change is persistent and durable. And so we're only gonna have to maintain an ephemeral or transient undo log because we need to be able to roll back a transaction if it aborts. And the reason why we had to do this is because the CPU is allowed to flush anything out of its cache at once. So maybe the case that although our transaction has committed yet, things moved out of L1, L2, L3 out into our MVM device. And so if we need to roll back the transaction, we need to be able to know that this thing didn't actually commit and we know our versus changes. And likewise, if we crash before the transaction was aborted or committed and we wanna be able to roll back and know that in our MVM, this change actually shouldn't have not been made. So it looks sort of like this. So now everything's in MVM and when we wanna modify this tuple, we only need to store in our right of head log a pointer to this tuple to say this is the thing that got modified and then we can write our change directly to here. And only then when this thing is actually made in place and we know it's durable, then our transaction is safe to commit. Now, if we crash before we, if we crash before we get to, crash before the transaction gets to modify this, we have that undo log in MVM to go ahead and reverse it. But if we crash after this transaction committed, then the memory allocator, the library just goes and checks to make sure that the pointers are all valid and can reverse anything that shouldn't have been there yet. So when we commit, we not only commit to the right of head log, we also commit to the memory allocator. Yes. Yes. So your comment is that the cache, the CPU cache is essentially the volatile DRAM. Yes. And in the DRAM disk case, it was found that doing the right hand dog and then flushing data afterwards is better. And here you're saying that we can be flushed right away because the cache is much smaller than DRAM. DRAM was in case of DRAM disk. So his question is, are we flushing before the transaction commits instead of after? Or sorry, we're flushing on right before the transaction commits rather than after because the CPU caches are smaller? No, not before. When it commits, you're flushing with VM. Yes. Which is the equivalent of flushing from DRAM to disk in the earlier case. Correct, yes. And earlier it was not done so that you could amortize the cost of writing to disk. Yes. So now you're saying that we don't need that. Correct. Is it because the latency is much lower or because the empty cache size is much smaller than you amortize? Yeah, it's because the latency is slower. Latency is lower. And that we don't have, like in DRAM, because we don't have complete control of how things get moved from CPU caches to MVM. Like in DRAM to disk, we have complete control of moving those blocks, right? Now in our case, we can be sure, we can call CL flush, we can be sure that we've moved things out to MVM. But as I said, the CPU is allowed to move stuff down anytime at once as well. Right, so because it's faster, we can do that. And because we wanna make sure it's really there, we can do that ahead of time too. So- So the statement is, should in the futures, should the database system take complete control of the hardware's caches? They're adding, they slowly add more and more instructions that expose different functionalities, like prefetching with the CL flush stuff. In many cases though, the hardware stuff is pretty good. We didn't talk about prefetching, but prefetching usually does much better than software. I don't know the answer to the question. But whatever they expose to us, we'll try it, right? Okay. So for the copyright engine, again, you sort of think this is like shadow paging from your intro class, right? You have a master record at the top that points to some current directory pointer, and then you have some kind of tree structure where you have leaf nodes, and then these leaves point to slotted pages where you have a bunch of tuples packed in. So now if I wanna update a tuple that's in slotted page zero, the copyright engine will first make a copy of the update of this leaf here and the entire page and just modify the one thing that it needed. So if there's multiple tuples in here, you copy the whole thing, and then you create a new dirty directory, and in this case here, leaf two hasn't been changed, so both the current directory and the new dirty directory can point to it. And then once we know all the changes by the transaction are done, then we go update the master record pointer to now point to our new thing, and then we can do garbage collection and blow away all the stuff that's not accessible by the master record, right? The problem with this, obviously, it's very expensive to do these copies for this data because although I wanted to update only one tuple because I'm still based on the sort of block architecture of SSDs and HDDs, I have to copy this whole slot of page over, right? So that's very expensive to do. So in an NVM optimized version, instead of having slotted pages, we can have this thing just have pointers to tuples because everything is gonna be NVM. So now when we update this one tuple here, we can apply the changes and only have to copy the pointers to things in our updated leaf, right? We don't have to copy the whole slot of page. And then we can do all the same stuff that we have before of creating new dirty directory and then updating the master record to point to it. You still have to do garbage collection, you still have to do all the other internal maintenance stuff to make this thing work, but the key thing is that here, we're copying less data, right? And we'll see in a second, you don't get that much improvement for using a copyright architecture when using NVM and other than LMDB, I don't know if any system actually works like this because it has a lot of overhead. This is sort of the approach that system are used in the early days and they abandoned it because it was really slow. And I would say that when we first started this project, Joy and I actually tried to make this one work because the idea was we were to take 1970s technology and use or it's 1970s database concepts and use it on 21st century storage technology and it would magically be awesome and just work. It didn't turn out to be the case at all. All right, so the last one is the log structured engine and the canonical architecture from like level DB or rock DB is that you have a mem table in memory and this usually has a B plus tree that then points to things down in the right head log and then you have an SS table that's out on disk that has a bloom filter on top that'll tell you whether a key's in there or not and then you have an index that points to the dots below. So if I'm gonna do it right to a tuple, I update my right head log, append a new tuple delta and then eventually over time, this'll get full and I'll create an SS table where I have tuple deltas and compacted tuple datas in here as well. So again, for one modification to a single tuple, it may get written multiple, multiple times because you're copying from SS tables to mem table, I'm sorry, mem tables to S tables, you're combining SS tables, you're compacting things, right? So there's a lot, a lot of rights but and it's unnecessary because our mem table will be able to fit entirely in memory and MVM. So all our changes here are persistent and we don't actually need this thing at all, right? So we can cut down on duplicate data if we get rid of the SS table and we don't have to worry about compactions anymore because everything is nice in in here. And so to do the MVM optimize, you'd basically this goes away and now you only have the mem table in MVM, right? All right, so just to summarize what we've done is we're gonna leverage the byte addressability and the durability of MVM when we treat it as a persistent memory to avoid unnecessary duplication and this is gonna have two effects. One, this is gonna reduce, improve our throughput because we're copying less data now but it's also gonna reduce the wear down on the device because now you're doing less rights, right? So for these new MVM stuff, it's gonna be like an SSD where you only have a certain number of writes before a cell burns out. It's gonna be much higher endurance than SSDs but it's still gonna be finite. So reducing the number of writes will reduce the, will increase the lifetime of the device itself. And then for recovery, in many cases now because we only have these pointers and our memory allocator can look at its metadata data and his page will make sure that everything's all consistent when it comes back up. We don't have to replay a log and we don't have to spend a lot of time loading back in snapshots and checkpoints, right? We can come back, do a quick check to make sure our pointers are valid, maybe roll back anything that shouldn't have been written out to MVM and we're good to go and that's much faster to do than these traditional architectures. So for our evaluation again, we're gonna use the end store testbed system that we developed and again to avoid the overhead of transactions and other things, we're gonna use the HStoreStock concurrency control where we have single threaded engines that are protected by a lock and a transaction can only execute once it acquires the lock and it runs without contention from anybody else. So there's no level, no low level latches protecting end data structures. Yes. Yes. So his question is, do we do anything internally in our architecture to move data structures around to avoid the having to do a bunch of rights over and over again to the same cell and wear out the device? The answer is no. So we punt on this, we say that either this is something the operating system can provide or in the same way that the flash guys provide on the actual, the A6 on the cars themselves, we assume that someone else is gonna take care of this. Whether it's hardware or the operating system, we don't care, we're just writing and letting the device do whatever it wants to do. But yeah, absolutely in the future, you can imagine if you're just writing, if you have a counter that's in MVM and you're updating it a million times a second over and over again in some one location, you'll burn out that cell pretty quickly. Yes. The statement is the copyright stuff kind of does that for you. Yes. Well, it depends on how you implement your allocator buffer. Like if you're just using, you're just reusing pages over and over again, you could burn it out. All right, in the case of the in-place update, if you have one tuple and you're just always updating that one thing over and over again, you'd burn it out too. We're assuming someone else is gonna fix that problem. Which is nice about databases. We're assuming that you punt it down the road. All right, so as I said, you can't actually buy MVM today, or at least not in the byte addressable dim format stuff. So for this paper, we're using a experimental MVM Harbor emulator that was developed by Intel Labs. And basically the way it works, everything still does DRAM, but they go into the debug hooks for the memory controller and they add busy loops so that when you read and write to memory, it spins for a little bit and slows down the latency as if it was actually an MVM. So and it's really kind of cool because you can go in the BIOS or a kernel setting that you can change what the latency is you want for these devices. So for these experiments that I'm showing here, it'd be 2x the speed of DRAM, but I think we can go up to 8x in the emulator. And depending on who you talk to, the range is gonna be about four to 6x slower than DRAM. And for this experiment, we're using YCSB, but we're using a right heavy workload with 10% reads and 9% writes because we want to stress the actual device itself and see how well our architectures perform. And then we have a high skews getting where there's a hotspot in the system where all of the transactions are trying to update the same thing. Now we don't have contention issues from transactions because we have a single threaded engine so every transaction can run without having to try to acquire a lock on the actual tuple. So this is sort of almost running a bare metal speed for all our transactions, which is what we wanted to do. All right, so in the first experiment, what we're gonna do is just measure what the runtime performance of the throughput we get from these different architectures. So the gray bars represent the sort of traditional implementations of these architectures. Like if you take an intro database course, it's essentially exactly how you would see in a textbook. And then the other bars correspond to the MVM optimized versions of them. So this is, again, the gray bars are assuming that memory's volatile and they're not taking advantage of it, whereas the red bars are assuming that memory is non-volatile if you use it correctly. And so across the board, what you see is that the MVM implementations, MVM aware implementations, get much better throughput. In this case here, it's over almost 2x better. The copy and write one is always the slowest, followed by the log structured one. And this is not really surprising because this is what you would see in a disk basis and even today you would see the same kind of numbers. But we also can measure and also in the system the number of writes we're doing again because we wanna measure how often are we doing writes and therefore how long would the device actually last. So in this case here, lower is better. So we see in place updates are doing much fewer writes than all the other ones and the copy and write one obviously is the most expensive because for every single update, you're copying a large slice of the tree over and over again. So we see a reduction of about 40% for in place, 25% for copy and write and 20% for log structured. So again, this is showing you that if you treat MVM as truly persistent memory in your database system architecture, you'll get, the device will last longer. And then in this last experiment, we can measure how long it takes to recover the system after a crash. So the first thing obviously, there's a giant space missing here for the copy and write engine because it has no recovery policy at all, right? Because if you crash before transaction finishes, the master record pointer never got switched. So you're still pointing to the current directory and therefore it's consistent. So there's nothing to roll back, right? So that's why there's no numbers here. Now in the case of the in place updates engine in the log structured engines, what you see is this stepping function going up where as you increase the number of transactions you have to recover after a crash. So these are number of transactions in your log that you have to replay. Obviously you increase the recovery time because there's more transactions to recover, there's more computational overhead. But for the MVM optimized ones, it's nearly flat because the only thing we have to do when we come back is just to make sure that all our pointers are still correct. And it doesn't matter that, you know, how many transactions are logged. We may have to roll back the last couple of ones, but it can be done really quickly. So again, this is showing you that not only do you reduce some of our writes and get better throughput when you run the system at runtime, when it comes time to recover the system, you can almost get instantaneous recovery. Right now, obviously if you increase the size of the database, this will go up. But in the terms of the number of transactions we have in the log, it doesn't really affect it. All right, so the main takeaway I want you to get from this lecture is that when MVM comes along, there are things we're gonna have to do to make sure that our database is using it correctly in order to get the best performance of it. I think that MVM is gonna have a big impact on the design of software systems when they come out. And people are in a scramble to try to figure out how to update their existing architecture, existing implementations to use it correctly. And as I said, I think the end memory guys are gonna be better positioned because it's not gonna be a major rewrite to have to pull out the buffer pool manager and all this other legacy code in order to use MVM correctly. We have to make sure we use maybe a smart memory allocator and make sure we're syncing our writes correctly. But a lot of the stuff we talk about in this class doesn't have to change. So I think that's kind of cool. So any questions about MVM? Any questions about the paper? Any questions about this? Okay. So in the last 15 minutes or so, I wanna spend time talking about the code reviews. So again, as I mentioned before, part of finishing up the class for project number three is that each group is required to give a code review to another group in the class. And the idea is that you wanna look at their implementation of their project and provide them feedback on what they can do to improve their code, maybe fix some bugs and ask questions about what's this thing they're actually doing. And the idea, and we wanna get you experience in doing this because when you go out in the real world, it's not like you're gonna write a bunch of code and throw it over the fence and then you're done with it, right? You're gonna have to sit down with other people and look at your code and try to make sense of it and see how it fits into the larger system. So the way we're gonna do this is that each development groups, each group that actually works on a project is gonna provide a pull request on GitHub to the group they've been assigned to review their code. And then the reviewing group will look at the code and provide comments in that pull request and suggestions or ask questions or ask for clarifications on what the actual implementation does. And so we can keep track of all this. I'll send an email out as a reminder that everyone should send me the URL for the pull request so that Joy and I can look at it and make sure that everyone's doing what they should be doing. So this is sort of part of your grade for project three is your participation in doing this pull request. And I'll provide a full write up of some of the things I'm talking about today and other the protocol for how we're gonna do this and I'll post it on Piazza later in the week. So the due date for this will be May 8th at 12 p.m. So this is two days after the final project presentation and three days before the sort of the code drop. So sort of smack dab in the middle. And so I realized this is gonna be kind of tough because not only do you have to provide your code available to the group that's gonna review it for you but you also have to do the review for their code. So that's why I'm sort of putting this right before the deadline but after the final presentation is give you enough time to sort of finish things up. And it sort of goes without saying that you should try to be helpful and courteous. Don't be a dick in your code review. As far as I know, nobody's here is actually graduating. So you're all gonna be back in the fall. So if you're an asshole during this code review you gotta come back on fall and see everyone again. So it's not gonna be good. Are you graduating? All right, so you can be a dick. No, so again, this is meant to help you improve your coding. By seeing other people's code, seeing other ideas, getting feedback on it, I think it can help everyone improve, right? Yes? Is there a due date for when we have the 7th or 4th question? No, oh yeah, I probably should have said that. Let's say, let's say maybe May 6th or maybe we can pump this out to May 9th and maybe have the thing be May 7th, yeah. Right, you send a poll request to your team and they have two days to look at it. And then the May 11th when you do the found and code drop that gives you two or three days to actually apply the, you know, make the changes that they suggested, right? Again, I'll document this on Piazza what I expect. But what I wanna do is sort of go through some general guidelines, some general tips about what you should be doing in your, for your code review. So first, the team that wrote the code should provide the reviewing team with sort of high level summary or an outline of what they should be looking at, right? If you factor some code or ran it through the formatter and just change some minor things, you obviously, the code reviewing team doesn't wanna look at that because it's a waste of time. So you wanna provide a summary to say here's the files or here's the functions we spend most of our time on and this is what we want you to look at. So the general rule of thumb that I found online is that they recommend that you don't really review more than 400 lines or so at a time and you only really spend 60 minutes of at a time doing the code review, right? I don't know that I'll work out for this project but we'll see how it goes. But the idea is that like you wanna dedicate 60 minutes of uninterrupted time to look at the code and figure out what's going on rather than being distracted by other things. And when you start doing your code review rather than just saying, hey, I'm just gonna read a bunch of code, you should have an outline or a plan what you're actually looking for and what suggestions you're gonna try to provide in your review. So you wanna use a checklist to provide what kind of problems you're gonna look for and I'll go through what those examples are in a second. So the checklist can be sort of broken up to three parts. The first is sort of the high level things about the code itself, right? Obviously whether it's the code work, can you read it and understand it? Is ZQE dropping down and doing assembly in the project again and nobody can understand what's going on? Is there a lot of any redundant code or duplicate code? And this is a big deal, we wanna avoid this. Did it look like they took one file, copied it and renamed it to another location and made some minor changes? That's bad, we wanna avoid that. Is the code to be modular? Like it's not just one giant function or one giant file that does everything, right? Does it look like things can be broken up and reused? We wanna avoid global variables as much as possible or at all in our system. Obviously in Postgres, because it's written in C, they have global variables up the wazoo, but since Peloton code is in C++, we should be able to avoid this. Are there any large segments of commented at code? If so, those should be removed, right? If you wrote a bunch of functions that you end up not needing, just delete them rather than leaving them commented out. And then lastly, are they using the proper Peloton debug functions, right? I think we have checks to make sure there's no printf or cout statements or cerror. And you should be using sort of the debug info stuff that we provide you, right? And that way we don't see these random writes to the standard out that's slowing down our system. We have to figure out where in the code they're actually located. The next category stuff is based on the documentation of the system. So it sort of goes without saying, but we wanna make sure our code is commented, it explains what's going on, right? And it's not trivial stuff like, here's a for loop, right? It actually explains hard parts. All the functions commented. If there's any edge cases or weird stuff that you had to handle or that's maybe sort of one off our special case in the code, is that clearly documented? If you're using any kind of third-party line or third-party libraries, obviously I don't care, you don't have to document it, I'm using the SDL vector, but if there's some other complicated library you're using in a special way, you should actually document what that actually is. And then lastly, very important is that if there's code that you know that is incomplete, meaning there's some corner case you didn't handle, some functionality that you didn't finish yet, should be clearly documented with fixed-me or to-do labels and explain what is missing, what's going on. Because a lot of you have asked me whether you can come do a capstone project or independent study in the fall, and a lot of you said you wanna pick up where you worked on the class and worked on the fall as this independent study. So if there's some feature that you didn't implement correctly or finish off, unless you document it, when you come back in the fall, after you spent all the time in LinkedIn, Facebook, and Google eating all their free food, you're not gonna remember what the hell you actually did back in May when you were trying to scramble and finish up the project. So it's in your own interest, if you wanna work on these things, to document where you left off. So that way you come back in the fall and you can pick up right where you left off. So this is really important to make sure we do. And then the last one is that we wanna make sure that everyone has proper tests for their new implementations. So you wanna look to see whether the tests actually exist and whether they actually call and say the things, they're testing the things that they say they're testing. You wanna make sure that they're not relying on hard coded answers. I'm not saying for this project, but I've done other projects in the past where I've seen students have the code of right print out statements that they then capture in the test to see make sure that those print out statements appeared. Stupid things like that, you wanna avoid or make sure that everything is, if someone changes the high level, low level implementation of a feature, your tests can still check for those things. And then the last one is you wanna make sure that you have good coverage of your code in your tests. So we didn't really talk about code coverage in the class, but the basic idea of code coverage is a metric that allows you to assess what percentage of the actual implementation code is actually tested by the test code, right? And so in, I may not seem this, but in Jenkins for all of you guys here for your projects, when it actually builds your code, when you commit to GitHub, it actually computes code coverage table. So this is for our system here, and it's kinda hard to see here, but here's all the directories in our system, and then this over here is a percentage of what lines of the code that's actually executed by the test cases and what functions are actually executed. So down here, they're all green because of like near 98% because this is the test code itself, right? Obviously the test code has to execute, so that's counted. But all this other stuff up here is parts of the code that were not fully tested. So you may ask, well, what's the right number, what should I shoot for? And it obviously depends the more the better, but I will say that you can't see it here, but these two lines, the yellows, are 84% and 88%. This corresponds to the concurrency control system and the execution engine system, right? All this other stuff is networking, which we're not really using yet, and certainly the post-gres code, we're not testing fully because we're slowly getting rid of that. So I would say that you kinda wanna be up in the 80s, high 80s if possible. 100% would be amazing, but that's unrealistic. So again, when you write your tests and you run this in Jenkins or even from the command line, it'll generate this information for you and you can see for the code you're reviewing what coverage they have for their tests. And again, email us, join I, if you ask questions about how to access this information. And this is automatically done for you on Jenkins. You don't have to do anything extra. Okay? So for the pairing off the groups, I tried to group together a paired off groups that were working on parts of the code that were somewhat similar, right? So the logging team reviewed the multi-threaded guys, the constraint team reviewed the collection, UDFs and memcache, query planning, concurrency control, statistics and query compilation, right? And so I'll post this on Piazza again. It's on the actually the Google Doc spreadsheet as well. It'll tell you who you're assigned to it and you want to contact them when it comes time to read the code review and say, you know, where is it? All right, any questions? Again, it's not meant to be very taxing. It's really meant to be an exercise to help improve you as a programmer, as a software developer, as a database developer, right? You'll learn from other people. You'll learn new ideas about how they code things and, you know, it's not meant to freak you out, okay? So that's it for lectures. On next class on Monday, I'll start off in the beginning with doing a quick review on what sort of would be expected on the final exam that's on Wednesday next week. And as I said, I'll provide you guys over Piazza two sample questions of what the kind of questions I'm gonna ask. There'll be three questions on the exam and they're like short or short essays, right? So I'm not gonna ask you multiple choice things. I'm gonna ask you to sort of think about and combine the different topics that we talked about in the course and apply them, you know, use some critical thinking and apply them to these questions. And then we'll spend the rest of the time in the class with a guest lecturer or a tech talk from Anchor Goya who is CMU alum, but he's also I think employee number three or four at MemSQL, he's now the VP engineering. So I've asked him to come talk about their new query compilation engine in MemSQL that came out in MemSQL 5.0. And I think it's really interesting and it goes well beyond what we talked about in the course. All right, any questions? We're done. All right, thanks guys.