 It's been a trying semester, but we finally got through it. So for this last discussion, we're going to talk about databases running on sort of new emerging hardware or sort of non-traditional hardware that's slightly different than everything we talked about so far. Before we get into that material, I quickly want to go over what's remaining for you guys from now until the end of the semester or final grades. So on Wednesday this week we'll have a guest speaker from Amazon come give a talk about the things he's been working on at Redshift. So this will be live, this will be on Zoom. It's only available for CMU students, so I'll post the details in Piazza. On May 4th, next week, you'll have the second round of Code Review submissions. May 5th, you'll do your final presentation, also on Zoom, also only available to CMU to discuss what your group has worked on. The final exam that I gave out last week will be due on Wednesday, May 13th. And then what's missing here is in the Code Drop that's posted on the website as well. So when you have to submit all your information to me to say you're actually done after incorporating the second round of Code Review comments. And then on May 16th, on Saturday, we will have our extra credit hackathon. So again, this is optional, this is available to those that want to participate. It's actually also going to be open potentially for non CMU students and so we'll figure out how to coordinate that. Again, the idea is that you're not going to work on, you know, just keep working on what you worked on for project number three. It would be something new or sort of adding a new like SQL function or a new feature to expand the support of our SQL system for this. Okay? So again, I'll post details about this on Piazza. One additional thing is I need everyone to also fill out the course evaluations at this URL here. So this is super useful for me because you know, I realize that the last, you know, last half of the class semester has been online only. So in general comments about the projects, the reading assignments, the general cadence or the pace of the class, those things are actually very, very useful. I actually read them and I actually take into consideration of tweaking the class from one semester to the next. The university and the department actually read these things so this is don't take this lightly and don't be like most master students often are where they just click five, five, five for everything. Like actually, you know, if you want to spend time and give me sincere feedback, please do. It's entirely anonymous. So to give you a kind of idea, a feedback that I often can get, we got this once year where people rightly pointed out that I had a body odor problem. This has been since resolved in subsequent years with sort of special shampoo. So hopefully no one has been to offend by any odors that my body admits. But again, this is super useful. I didn't really know that I had a body odor issue until this person pointed that out. And then I went to the doctor and he's like, oh yeah, you have this problem. This is medical condition. Here's this special shampoo you should use. So this is why I want you to be candid and open about all your comments for the course. I do take them in consideration. Okay. The, as I said, the thing we want to talk about today is running databases on sort of new hardware, hardware that's not just, you know, CPU and SSDs and spinning disk drives. And so this has been an ongoing theme in databases for since almost the very beginning where people have always been looking to use specialized hardware or new hardware to make database systems go faster. In the early days, in the late 1970s, there was this movement called database machines where the idea was you would buy like an appliance, a specialized server that had custom hardware ASICs to do, you know, database operations. So this common one was that you could buy a database machine that had specialized hardware to do hash joins very efficiently. So this movement fizzled out in the 1980s because of Moore's law that, you know, Intel and Motorola and Deck were putting out new CPUs all the time. And so by the time if you were a database machine vendor, by the time you went and, you know, designed and fab and sold your specialized database hardware, Intel put out the new version of x86 that got even faster. So it was, you would get diminishing returns on the amount of effort you had to do to build these things. So in the 1990s, for the most part, everybody was running on commodity hardware. And certainly when the cloud came along in the 2000s, this is even more so. In the 2000s, though, there was some early attempts to build FPGA databases where the idea was you would have the FPGA sit between the CPU and the disk controller and you would just push down predicates on that. So Netizo was a famous system that did this. IBM bought them. I think they sun-setted them or killed them off about a year ago. But they were the first FPGA database system. And again, there was now also a bunch of appliance databases where unlike a database machine where it had specialized hardware just for the database system, the idea of an appliance was it was a commodity hardware. But the system and the operating system, the database system were tuned explicitly for the hardware that was running on. So you could just buy this one rack unit that had my SQL running on it. But my SQL was already tuned for the exact hardware. So it was sort of achieving the best performance. But again, because of the cloud, this sort of fizzled out because everyone just said it's just cheaper to buy commodity stuff on it from Amazon. In 2010s, the FPGAs have sort of always been there, I think in recent years they've become more prevalent. But the big thing we saw in 2010 and the last decade was the rise of GPU databases. And this is where because of the big interest in using GPU computing for machine learning, people correctly identify that, oh, I can actually do some database stuff on the GPUs and take advantage of all the advancements that the machine learning guys are getting. And so we'll talk a little bit at the end of this lecture, just what these GPU databases are and what they look like. So now in the current decade that we're dealing with, I'm actually very excited because I think it's gonna be the Wild West again in terms of everybody's gonna be trying everything. I think there's a lot of interesting things coming out in hardware that may not be explicitly designed for database systems. But sort of data intensive applications, if you wanna call it that, which includes machine learning or data science things. But databases are a key component in that kind of stack. And I think that there'll be some things that we can start incorporating in database systems and still have that be considered commodity hardware. So the main thing that we're gonna talk about today is persistent memory and just sort of how you design a database to handle this. I think this is gonna be a major change in this decade. The FPGAs and GPUs are still be around. I think that there's still gonna be niche players. I don't see them being like every database system is gonna have to have a GPU or an FPGA sort of accelerator component for it. It's majority of the databases are gonna run on, data systems are gonna run on Intel CPUs. Going beyond FPGAs are these things called Configural Spatial Accelerators. Think of this as like an FPGA, it's like a programmable hardware that instead of doing the logic that the FPGAs do, it's more of a data flow thing. And again, when I say more, it's hard to predict what else is gonna come out. Fabbing costs should be going down, especially for sort of maybe like 70 nanometers, like sort of larger size transistors. So people can start fabbing stuff much more cheaply than they've been able to do before. So the economies of scale sort of help us. So again, we're gonna focus on this today. We'll talk a little bit about GPUs. But I think in the next 10 years, I think a bunch of more things are gonna come out, which will be pretty cool. Okay, so as I said, we want to spend most of the day talking about persistent memory. We'll talk a little bit about how to accelerate things with GPUs. And then we'll finish up talking about Harvard transactional memory. Because this one also often comes up with students asking about, is this something I could be using instead of having to do all the concurrency or latching stuff that we talked about this semester? And the answer is going to be no. We still need to do everything that we've talked about so far. And this may help in small cases, okay? So, persistent memory, the way to think about this is that when we talk about in the intro class this dichotomy between the volatile and the non-volatile storage and how we had to design a disk-oriented database system to account for those differences. And certainly last class, when we talked about larger than memory databases, we needed to be aware that our database could have been writing data to a non-volatile block-based storage device. That's much slower. So we had to design our algorithms and our hierarchy to account for that. With non-volatile memory or persistent memory, the idea is that we're going to get almost the speed of DRAM and have an access interface that is addressable, byte addressable, like DRAM. But the hardware will be able to retain all our recent rights, even if the power is lost. That's why it's called persistent memory. So I'm going to slip up multiple times during this lecture and keep calling it non-volatile memory because that's what we were calling it when we first started doing this research back in 2013. The industry has standardized on calling this persistent memory, which I actually agree with as a better term. Sometimes you also see this called storage class memory. But they all essentially mean the same thing. So the first persistent memory devices that were available, which is sort of confusing because they were like PCI express cards that were block-addressable, even though the storage medium inside of it was the same thing that's going to be in the persistent memory we're going to talk about here, it just provided it through a PCI interface. But the new ones that are actually available now from Intel are going to be byte addressable. So it's going to look and smell exactly like DRAM to your application. But there's some extra stuff going underneath the covers to make sure that everything is persistent. I have to let the tier here in. So let's talk about how we got where we are. Because for me, this backstory is actually very interesting and sort of part of the reason I spent a few years researching persistent memory and databases with my first PCI student. So if you take, if you're an electrical engineer and you take a fundamental course on circuits, they'll describe three types of circuits. We'll talk about a capacitor, which was invented back in 1745. This is the ability to store some charge like a battery. Then later on, the resistor was invented to modify the voltage that's coming in over your circuit. And then a few years later, they developed the inductor, which is just a way to convert the voltage into heat. So after 1831, it was just assumed that these were the three fundamental circuits. There couldn't be anything else. Like the way to think about this is you can't build any of these other types of circuits using other, sorry, any of these elements using another element and that's sort of like an atomic element of the circuitry. So then in 1971, there was a professor, Leon Chua, at Berkeley who was working through some equations and he discovered that there seems to be that there should be a fourth type of element because the way the math worked out was that there was this missing component of the equation that you had to have this other fourth element in order for the math to actually work out correctly. So he hypothesized that there was a two terminal device where the resistance of that device depends on the voltage that's applied to it. So it's like a resistor, but the difference is that you can actually change its resistance depending on what voltage you give it. And then when you turn off that voltage, it permanently retains, remembers its last resistive state forever. And so what he hypothesized was that there was this fourth element called the memrisker. So he wrote a paper about this in 1971. It was sort of lost to time because it didn't have a lot of citations. It was very mathematical. Nobody understood it and it was essentially forgotten. Flash forward now to the early 2000s and there was this team at HP Labs that was trying to build sort of self-configuring nanodevices. And what they were finding in their experiments is that these nanodevices would have certain properties that they couldn't understand why they were doing certain things. And in particular it would be like when you give them a voltage, they would change the resistance you were seeing in the circuit they were trying to build. And so they looked and they couldn't figure out what it was. And they kept looking in the literature. And then they just happened to stumble upon the 1971 paper from Chua that says that, oh, there's this other fourth type of circuit that actually could exist. We just don't know how to build it yet. And then they determined that it was actually... And HP Labs actually ended up accidentally building a a memrister, which is super interesting. And part of the reason why they figured out that what they had built was the same thing as what Chua hypothesizes that there is this graph of like the circuit that shows this sort of hysteresis loop and what they were measuring exactly matched what he proposed or in his conjecture that this is what it should look like. So then they went back and for this paper they wrote, How We Found the Missing Memrister, they went back and looked at the last like 100 years of electrical engineering, scientific publications. And they found a bunch of other people reporting the same hysteresis loop in their experiments, but no one could explain what was going on. So people had been stumbling upon the memrister for years and years and years, but nobody actually knew what they were actually building. So HP made this big announcement that they had discovered the memrister, that this is something that they were reliably able to reproduce in the lab and that they think they can actually go ahead and manufacture it and that this was going to be a major gain change in the field of computing. So, so much so that like in 2008 they had this big presentation, I guess at their whatever yearly conference that talked about their work on memristers. And you can see here that I think this came out in 2007. So they discovered it in 2006, proved that it actually was real. 2007 they're at this conference, 2008 they're claimed that the memristers will be development ready. And then in the near future they were going to claim that memristers were going to replace all DRAM and hard drives and SSDs and transistors and everything were going to be running off memristers. So this was over 10 years ago. DRAM's not not gone, SSDs aren't gone, spending his hard drives aren't gone. So what happened? Well, HP as far as I know has still not produced a or shipped a memrister product. HP then eventually also split off between like the consumer side and the enterprise side. They had this you have this moonshot project called the machine that was going to run entirely off of memristers as far as I know that has was was canceled and at this point I don't know whether any memristers are going to come out, at least from HP. So let's talk about other types of persistent memory or let's understand now a little bit about what we're going to be talking about today for Intel's device, what the memrister is, what it could have been and what some future technologies are actually going to look like. So what I'll say too is also like I drank the Koolaid from HP, although I had no affiliation with them. I thought memristers were a big deal and I was really excited and I sort of why I went down this path of doing persistent memory research here at Carnegie Mellon and I was always under the impression that the memristers were always two years later, two or two years away, right? So like every time HP had a press conference, every time HP says something publicly, it was like it's two years later, it's two years later. And then you get to the next two years and then come out and say the same thing, it's two years later or two years away and it never happened. But Intel actually shipped a device, which is the first one here, phase change memory, which is pretty exciting. So let's go through each of these one by one. Again, this is not specific to databases, just sort of you get an idea about what's going on underneath the covers with this technology. So phase change memory, the idea is that you have this storage cell that has two metal electrodes going into it and what happens is that you put a charge into that, this phase change material that's calcined and that essentially bakes or cooks the material to be able to change the resistance of the circuit. So if you give it a short pulse, then that changes the cell to a zero because that gives you a different resistance. If we change it to a long gradual pulse, then that'll change it to a one. And again, I'm showing this heater here, it's not actually, it's not a little match underneath it, but underneath the covers you're giving it either a short charge or a faster charge and that changes it to be zero one. So the idea of a phase change memory has been around for a while, people have known about them, there's nobody's been able to manufacture them at scale. And the Intel Optane DC memory that we'll talk about is, to the best of my knowledge, is actually phase change memory. It's not, they haven't said it publicly, at least I don't think they have, but when the devices first came out, some guy in South Korea took it open, cut it, busted open the device and looked at it on an electron microscope and saw that it actually was doing phase change memory. So there's some downsides of this, because you're actually having to put a charge in here, obviously this will generate some heat, so that prevents you from potentially storing it on the CPU itself. And you can only write to it so many times before it wears out. So phase change memory is here, it's fast, it exists and you can buy it at large capacities, but compared to memory stirs, I thought this was inferior technology, but of course this exists, you can buy this today, you can't buy memory stirs. All right, so memory stirs are a, this is sort of confusing, there's the memory stir of the circuit fundamental, the fundamental circuit element and that actually includes phase change memory or the spintronics, but then there's like the HP marketing where they would call whatever, they were selling the memory stir, but the sort of scientific definition of what they had built was called resistive RAM. And the way this works is that you have two layers of titanium dioxide above two layers of, in between two layers of platinum. And the platinum's gonna carry the charge and what'll happen is if you run the charge in one direction, you'll change the resistive state, if you run the charge in the other direction, you change the resistive state. So the idea is like there's floating electrons in between these two different layers and that's how you set it to be a zero or a one. So the cool thing about memory stirs, again, why I was excited about them is like titanium dioxide is a very common element. It's the same stuff that's in white house paint or sunscreen that you put on your face. So it's not like some obscure material that you had to manufacture. Platinum is obviously not super common, but for titanium dioxide there's a ton of it. So it was gonna be super cheap and actually super high density, petabytes per square centimeter because the current you're sending through this is much less than the phase change memory to change the state. The other interesting thing that's really wild about memory stirs or resistive RAM is that HP was talking about how you could use the storage fabric or the storage medium for executable logic. So they talked about how you can actually change the, like an FPGA, you can load a program onto the memory and have the, as data comes out of the memory, it would flow through your logic gates and do whatever additional processing that you wanted on them. So you think of like, I can do like in-memory computing, I can do a scan on a column and have some executable logic gates to apply the filter. And then the cost of changing that executable logic on the fly was super cheap compared to an FPGA. So you could like load it per query. And so there was all this talk about how they could build neural networks and memory stirs, they can model the brain with memory stirs. That was about 10 years ago and I haven't heard anything about it since then. The other interesting thing about the executable logic for memory stirs is that it wouldn't use the traditional NAND based logic that we use in our CPUs that we have now. It would actually use something called material implication which was invented by the great philosopher, Bertrand Russell back in like the 1910s. So it was like a completely different way of thinking about computing if you ran on the memory stir. But of course, it never happened, or it had to happen. All right, so the way to think about the three mediums we're talking about also here as well is like, there's the phase change memory that exists now, the memory stirs might be in the near future and then a little bit farther out will be this magnetic resistant ram or spin tronics. And for this one, instead of actually storing, instead of actually storing or changing the storage medium to record a charge, we're gonna change the, we're gonna move electrons using magnets. All right, so the idea is that this oxide layer is gonna move electrons between them and that's how you're gonna set them to be, set the bit to be zero one. And then supposedly this not only uses less energy, it has a smaller scale factor. So you can store this at like, you can have these be stored at 10 nanometers per bit. And the speed is almost equivalent to your CPU caches and like using static RAM SRAM. So you could now replace all your L1, L2, L3 caches with spin tronics, have that be super large because it's a higher capacity, much cheaper to manufacture sitting on the CPU and you basically have persistent L4 with like latencies less than DRAM. So this is super amazing, right? If this exists, this would be a big game hinger. I think for all of these, actually I'm not sure what the memory stirs, at first spin tronics and first phase change memory, prior to them manufacturing like DRAM DIM replacements, you can buy them in like small scale factors for like cell phones, things like that. So you can get, I think now, you can get spin tronics DRAM or spin tronics RAM in like 16 megabyte capacities. Certainly not enough for what we need in a database system but like, it does actually exist. It's not at a large scale. So why is this for real? So for three reasons, why persistent memory is actually a thing now we need to consider in our database system. The stars have sort of aligned such that we need to be cognizant about this technology and actually consider it when we design a new system. So the first is that the industry has agreed upon a standard technology nomenclature and form factor for these devices. So there's this thing called JDEC it's basically the consortium between a bunch of manufacturers. They said, okay, if we're making non-volta memory, here's what the form factors have to be, right? I sort of like DRAM, right? You know, there's DRAM 2, 3, and 4, that's the consortium that has decided this is what the form factor is, this is what the spec is and then all the manufacturers can go off and make the same, make devices that follow that specification. The next thing that happened in 2018, 2017 was that both Microsoft and Linux have added support for persistent memory in their kernels. And this is something called DAX, they say direct access extensions. This allows us to write programs that are able to use an API where it knows it's talking to persistent memory, right? Like there's basically sys calls that we can access this and we have the right instructions we would need to actually make sure things are flush, which is the next one here. So in 2018, 2017, Intel refreshed the instruction set for Xeons and added explicit instructions to do cache line flushes to persistent memory. Again, think about how you write programs now. I, when I do an update to a piece of memory, underneath the covers, that's doing a store operation, a stored instruction to update that memory. But my write is gonna land in my CPU cache unless I'm doing streaming writes, but ignoring that, my write lands in the CPU cache. But if that CPU cache is now being backed by persistent memory, like instead of DRAM, it's persistent memory, I program needs a way to know that the things that I wrote or that are sitting in CPU caches have made it out to, has made, actually made it out to persistent memory. Therefore, I know that my write is durable. When you think of, again, in a disk space system, I can call fsync, right? And that'll move it out of sort of whatever OS buffers that it has and actually persist it on disk and I don't get a return call to my application until I know the disk controller says that my write was successful. So we need the same thing for our cache lines and that's what these instructions give us. So this is sort of was the state of the world up until 2019, 2018, but then last year is when this stuff actually became available. So this is what Intel is shipping now. It's called Optane DC, persistent memory. And as you can see, it looks like DRAM. It has a DRAM form factor, meaning it can fit right on where DRAM exists in the motherboard, but instead of being volatile, it's non-volatile storage. Now, this actually works, is a bit complicated, but it's almost like an SSD where there's an ASIC on the device. That's doing load balancing or ware leveling and garbage collection and encryption and a bunch of other things. So this is more than just like, hey, I'm just writing this from all bits. This is intercepting the writes and actually doing something. So as far as you know, you can't go to Amazon because I tried today. You just can't buy this. This is something you have to get through like a manufacturer. So it's still not widely prevalent, but they are shipping this. You can get access to this today if you have the money. Price-wise, I mean, I'd have to go look. Yeah, actually, publicly, I don't think I can discuss the prices. I think it's three or four times the price of DRAM. We should just look that up. But it exists, which is awesome, right? So obviously over time, it'll get cheaper. There may be other manufacturers of this technology. So to me, this is a big deal. So how are we actually gonna use this? So from a database perspective, there's really two ways. And as far as I know, well, yeah, from a database perspective, there's two ways. There's an additional way to configure the device, the opt-in device to do writes a certain way, but we can ignore it. This is what we care about in a database system. So the first is that you just have DRAM being used as a Harvard-managed cache. So what does that mean? So this is our persistent memory. Whatever the size of this is, that's what the operating system thinks it has for the total amount of memory that's available to the database system. So now as I start doing writes for my database system, it'll go through the VM subsystem. The write will first land in DRAM because that'll be fast. And then I can return back to my application and say, yeah, we got your write in memory. And then eventually this will get pushed out to persistent memory. Or if I do a flush, then I make sure that that's actually retained down there. But the idea is that since DRAM is faster than persistent memory, at least as it exists today, I can have all my writes absorbed by this and I don't experience the slowdown necessarily of the slower latencies for persistent memory. So for Intel, they call this memory mode where again, we're just using it as if it was DRAM. And there's actually nothing we're doing in here in our database system for this setup that is aware that we're writing to persistent memory and just thinks that it's just DRAM. It's a larger, cheaper, potentially DRAM. And so that means that we still have to write along, we still have to do a bunch of extra stuff to account for this because again, we think that we're just writing to DRAM. The other approach is that we would have the DRAM adjacent to persistent memory. And now our database system is aware that we're writing to persistent memory and that it has the durability properties we would want. And so to do this, again, if I have a write, I can declare I want to write it to some region of memory in my address space that is backed by DRAM or I can have a write go to some region of memory that's backed by persistent memory. And I know that if I do that, write the persistent memory and I do my flushes, then my changes are durable. So Intel calls this application mode. And the idea here is that the application, meaning our database system is aware where the boundaries are or that we've allocated some memory into our address space that's in persistent memory versus DRAM. And that we can do flushes as needed. So the, as I said at the beginning, these devices first were arrived on PCI Express cards and they were block-adjustable. So from a database system perspective, that's not interesting because that's just, we just take our Discord and design choices we made from last semester and build a system to use that. It just looks like a faster SSD. And actually when we did benchmarking against some high-end Samsung devices, we really didn't see a major difference in performance for the PCI Express version. You just saw way more stable latencies though. Like there was less oscillation in performance. The one, the setup that we do care about is the second one I showed in the last slide, the application mode where I know I'm writing to DRAM and I know I'm writing to persistent memory and my database system can manage that, right? So because now we have to go and design our system to account for that, that now I can actually do byte addressable updates to some location in memory, some data structure or some table heap that will be guaranteed to be durable. And then Intel's device handles the case where even if I restart my application or restart my system, I come back and I can get access to my memory that I had before. It's not gonna restart the program counters for you. So it's not like you can manually pull the plug and come back and everything's exactly the way it was before. We still have to do some, potentially some recovery work because our process is gonna start up all over again. So this is a conjecture of mine. I still think it's true. We'll see how it plays out in the marketplace in the next couple of years. But I believe that when persistent memory becomes more widely available and that basically means like Amazon will give you access to it on EC2, the in-memory database systems that we've been talking about this semester, they will be in a better position to accommodate the byte addressable persistent memory because they've already written an entire architecture to assume they can do random access very quickly to storage. The in-memory database systems will probably end up starting with the second approach I showed before where you just have the DRAM sit in front and a persistent memory is a cheaper, more efficient memory pool or a larger capacity memory address space. So you don't have to rewrite any of your application. You're just using more DRAM. But then if you start wanting to be taking, wanting to take advantage of persistent properties of DRAM, of persistent memory, then you have to go and re-architect your system to potentially use a byte addressable API, essentially all the things that we talked about this entire semester. All right, so I wanna first talk about some storage and recovery methods for persistent memory. And this sort of just gets you to think about what can change if you design your system to account for, to be aware of that you're writing to byte addressable persistent memory. So this is the paper that my PhD student, Joy Rourage and I wrote a few years ago on looking at all the different sub-basic designs you have of a database system, storage architecture and what can change with persistent memory. So back in 2015, we actually did not have the device. The only, what we were using at the time was a hardware emulator that Intel provided us in Hillsborough, Oregon, where they modified the motherboard for these devices, for the system to introduce some additional microcode in like debug hooks for Xeon so that any time I did a load in the store, it was basically this sophisticated busy loop that would figure out the timings for how to slow down those load in the store operations to mimic behavior of persistent memory when it actually, when it finally came along. So this work was all done in a prototype data system we were building called Endstore. This is the first, one of the first data systems we were building at Carnegie Mellon when I started. This is actually what the Peloton project came out of. So we started building Endstore for this paper. The project kind of got bigger and bigger. We renamed it to Peloton and then we eventually killed all Peloton and that became the tiered database system that you guys are working on today. We threw away all the Peloton code and started over. So this is from Endstore to Peloton, deteriorated, that's how we ended up with this. But there's nothing in our current system that uses any of the code we use regenerate for Endstore because a lot of it was specific to the Intel emulator device and it was also before Intel put out a bunch of libraries to do memory allocation and other things you would need to write persistent memory programs. Nowadays, all that stuff exists with PMDK from Intel. Like they provide you with all the important constructs that we had to roll around back then. All right, so let's understand how we're gonna do synchronization with persistent memory. So again, the way we write programs now, when you allocate a bunch of memory, you do a bunch of writes to it, you assume that it's volatile and you also assume that it's gonna land in your GPU caches. If the program crashes, you lose everything. But now, since we wanna be able to write data durably and have it backed by persistent memory, we need to know how that actually works at a high level. So again, say our pipeline looks like this. This is our process that's running on the CPU. They do a store operation to update some location in memory. But that store operation is always gonna land again in our CPU cache, right? Because that's the fastest storage medium that's available to the CPU. But then now, if I wanna make sure that my data, my change makes it out to persistent memory, I could just wait and hope that eventually it makes it out. Cause again, what'll happen is if my cache gets full, the CPU will then write it back, write the change you actually made out to memory, right? And then fetch in the next piece that it actually needs into the CPU cache to make space. But I need to know exactly when that occurs. There's no callback method for this when it happens. That would be too slow. So I wanna tell the CPU, hey, write this out for me. So we can use a cache line write back instruction that then pushes the change to a memory controller. And this is sitting on the motherboard that has a small capacitor to keep it essentially it's battery backed. So the capacitor to size such that if I lose power, there's enough energy in here to make sure that everything actually makes it out to my persistent memory. And then at this point, I'll return control back to the program cause my right made it out to the memory controller. And it's responsible for making sure it makes it on the other side. And then this thing does what's called a asynchronous data refresh. So special instruction set it for Intel that proposes the changes to the non-volta memory of the PM device. So from our database system perspective, we just need to be aware of what cache lines or what memory locations we modify that we then wanna make durable. And we use a cache line flush or cache line write back instruction to make sure that happens. The next thing we have to deal with is if our database system restarts and we come back online, we have a bunch of pointers now to tuples or other memory data structures that we have. But how do we actually make sure that those pointers are still valid the second time we come around? Because I can't guarantee that and when I start allocating memory, that the first time that when allocated the second time my program starts, I'm gonna get the same virtual memory addresses that I had before. And so the issue is this, I have an index, it has a bunch of pointers to some tuples. And then let's say if I'm doing a pen only I'm doing multi-version concurrency tool. So I have multiple versions of the tuple and this in my version chain, I can have a pointer to another tuple here. But now if my system crashes, this gets blown away, right? And my index gets blown away. All the pointers are now invalid. What I wanna be able to do is come back online and have all my pointers still be valid. So if I'm using virtual memory addresses, there's no guarantee to do that. So what you essentially need is a memory allocator that is aware of these two issues that you need to be able to synchronize your data out the disk or at the persistent memory when you call the right instructions. And then when I come back the second time that all my addresses still point to the correct locations in virtual memory. So the first one, again, you're just doing the cache line flush but then you have an S fence to wait until, it's actually a barrier in instruction pipeline to make sure that those changes get flushed before you start executing the next instructions. It's sort of the same thing as the F sync going to the OS and not returning until the disk controller confirms the flush. For naming, the idea is that the memory allocator can, you can declare specialized pointers that come through the memory allocator that are backed by persistent memory and that you know that any time you have a pointer to that memory location, for your application when you come back around, the second time when you restart the program, those pointers will still be valid for your application. And so you don't want to write all the stuff yourself, you want to use PMDK from Intel and they provide you the low level primitives to do this. All right, so let's see now how we can use these primitives to build a database system. So again, for this paper, what we did was we looked at the three basic design architectures you can have for a storage manager and a database system and we identified where are the bottlenecks or what are the issues for when you're running on persistent memory and how can we redesign them to be aware that I can now write changes that are durable. So for this one, we're going to assume that for simplicity that there's no DRAM that everything's in persistent memory and therefore if I do the flushes, then I can guarantee that my changes are durable. So this way to think about this, this is like maybe 15 years, 20 years in the future where DRAM goes away and everything is just durable. So the first choice to do in-place updates, this is where you have a table heap plus a red head log, snapshots. And this, for our example, we're going to base our design on VoltDB. The next approach is to do copy on write updates and this is just like shadow copying or shadow paging where every single time I'm going to update a page, I make a copy of it in a site location and then when my transaction commits, I just flip a pointer to say, here's the latest version of the database. So for this one, you're making extra copies of the table but this doesn't require you to have a red head log for durability. And so our design here would be representative of something like LMDB, which uses this approach. The last one is a log structured system where you don't have a table heap, the log is the database. And we just keep depending to the log to do fast writes. So this would be something like levelDB or ROXDB. So we're going to go through each of these designs one by one and again, the idea is that take the textbook implementation of these architectures, run it on persistent memory, identify where the bottlenecks are or where the redundant updates are and then redesign the architecture to account for persistent memory. So again, for the first one, we have an in-place engine. The way we do writes is that we follow an index that lands to this tuple here and then we go ahead and update it and to make sure when our transaction commits that everything's durable, we're going to write out a delta record of the change that was made to write a head log and then apply our change here. And then some later point, when we take a snapshot then we make sure that our change gets persisted there. So for this one, we're going to update the tuple sort of logically once. It's like one update query we're applying for this one tuple, but we end up writing it out three times because we have to write out the tuple delta, get the write-out, the actual tuple in the heap and then we'll write it out to any additional snapshots that we take. So again, all this is running in persistent memory so everything here is considered durable. And so we have a bunch of duplicate data here because again, for the same update, we write it three times. This is also going to have a slow latency now because I have to do traditional Aries in my database system where I load the last successful snapshot or checkpoint and then replay the log with the, you know, analyze redo and undo phases. But again, everything is persistent. Everything is already durable anyway so it may not necessarily need to do all of that. So we can see how we actually want to design a system to account for the fact that we have persistent memory and use the fact that we have just pointers that can be guaranteed to be correct the second time you run our system to now only record what was changed rather than how it got changed. So to do this, we still have to maintain a transient undo log in case our transaction aborts while the system is online and we have to roll things back and we just make sure that any changes we make from a, we have to account for this because there's any change that we can't guarantee that the CPU will flush out any dirty changes to tuples that are sitting in the CPU cache out to memory because that's something beyond the control. We can't tell the CPU not to do that. It just does it on its own. So we need to account for that but we know that once our data is durable in our tuple, then we don't need to maintain the redo log for that so if the tuple is durable out in the table heap in persistent memory, we don't need to redo anything. So let's look at it like this. So we follow the index, everything's in persistent memory. We get to this tuple here. We apply the, before we apply the update we can now put an entry in our log that says here's the pointers to the tuples that got modified but that's all you need to know about it. And then I can actually now apply my change and then if I now flush this, then I know that the change for this transaction is durable, right? This is just sort of helping me to say, oh, by the way, here's what actually got changed in case I need to follow pointers and undo things but once this change is flushed along with any other changes for it have flushed my transaction is considered durable. All right, so let's see how we do a copy and write engine for this again. So this is a hierarchical version of shadowpaging. I'm just using a B plus tree. And the way to think about it is that we still have the master record that points to the master copy of the shadow copy and we do compare and swap to flip that but our pages are just laid out in a hierarchy and tree. So let's say I want to update this tuple here. So I would first make a copy of the, of that page and then the entry in the tree on the side here then update my directory to now point to, for the second page we still point to the original one for the new page over here I point to the new one. So that's another right to my entry to my structure. Then I do a third right now to flip the master record to the dirty record. So the first issue with this is that these copies are expensive because again we're taking this entire page even though we only update one tuple inside of it we have to make an entire copy. We have to update the leaf information, update the directory just to retain this. And so what we can do because persistent memory is, we can treat it like DRAM instead of having a page oriented architecture you can be a byte addressable architecture like we have now where we just have pointers to tuples. And then now when I do an update I only have to copy pointers over here as well as applying the change and then I update the dirty directory and the master record. So the key difference here is that the granularity of the change we're making because we can read and write to byte addressable locations in memory is much smaller. The last one is a log structure, an architect architecture. Again the idea here is that sort of classic architecture you would have is you have an in-memory memtable with a right of head log and a skip list or some kind of small data structure to keep track of what log entries are in memory. And then you have on disk, you have a bunch of SS tables that always have a bloom footer in front of it and then index that points to locations in the log. So when I wanted to update I first apply my delta of the change I made to the right of head log and then at some later point I'll flush that out the disk. And if I'm doing compaction now I'm gonna keep have right amplification where I'm keep applying changes or combining log records over and over again if they have to be retained. So the issue with this is that we have duplicate data and because if we're using a level of architecture we're gonna have the expensive compactions. So if you wanna switch this to persistent memory then we can get rid of the SS table entirely. Because now all of this is persistent. And therefore we can just have the right of head log and our data structure for that and we don't need to do any of this. We still have to do compactions though. I think that doesn't go away but we don't have to have this concept of a mem table and SS table which are different layouts of the data. In all the examples we just looked at as I said those were assuming that the computer we were running on only had persistent memory and there was no DRAM. But I guess that's not really gonna be realistic for a while. So let's think about how we actually would design a system today using what is available from Intel now. So let's target a way to speed up performance and take advantage of persistent memory by focusing on how to sort of take a standard table heat plus right of head log implementation and speed that up. So what is the right of head log doing for us? Well in either for a in-memory system or a disk-oriented system the idea is that we're trying to avoid random writes to disk by replacing them with sequential log writes. For an emory database we only do sequential log writes because we're writing to the table heap in memory. For a disk-oriented system we do our write sequentially to the log and then eventually in the background we'll flush out dirty pages. So that's one advantage we'll get. The other one is that we'll also get transaction capabilities because now if there are changes that are hanging out on disk from a transaction that did not commit before there was a crash before the system shut down we can use the log as a way to roll them back and reverse emitter changes. So again this design of a right of head log of writing to the tuple first then writing to sorry writing to the log first before we write to the tuple makes sense because again the log write is gonna be sequential. In a persistent memory world though we're gonna have fast random writes, right? By address what implies that we can jump to any location in the persistent memory space and that will have the access speed that's almost equivalent to doing a sequential access. So the huge dichotomy or the difference in performance we had in a sequential write versus a random write in a spring disk hard drive or even SSD is to be much less in a persistent memory system. So we want to design now a logging protocol potentially that can take advantage of this. And the way we're gonna do this is that the way we're gonna do this is maintain a multi version database and do copy on writes or make sure we don't overwrite existing versions and then we'll have in our log just the metadata about what was committed rather than the actual copies of the changes that were made. So this was the technique that my PhD student developed here at Carnegie Mellon called Right Behind Logging. And the idea here is that it's a logging protocol that's designed specifically for persistent memory but in a world that still has DRAM for the table. And the idea is that we can get instant recovery of the database after a crash with minimal amount of redundant information being stored in the log. And so the way we're gonna do this is that we only have a copy of the database in persistent memory. We make sure that we flush changes to that database and now our log is only gonna contain pointers to the records that got changed. And then now after a crash, all of the thing we need to do is just look in the log to figure out what transactions are running at the time of the crash or the shutdown. And we would have pointers to the tuples that they modified and then we keep track of the fact that the updates that these transactions made did not commit and therefore we know the pointers to those tuples and therefore we wanna reverse them if anybody tries to access them. So another way to think about this is like we, unlike in a write ahead log, we don't need any redo information because the changes that our transactions made are made durable to persistent memory right away and we never have to worry about re-applying them from the log. So the, now in the context of persistent memory, this protocol is new, there was one other system that we're aware of that did something sort of similar to this and this was brought to our attention by the great Phil Bernstein, sort of the godfather of sort of modern concurrency control and he told us of a database for a, I think it was a Puerto Rican telephone company back in the 1970s. So at the time Puerto Rico had bad power infrastructure and they would lose power several times during the day. So they had to have a database that could run that could come back instantaneously at any time there was a power loss because if you're shutting, having these abrupt shutdowns multiple times during the day and if your recovery time is super long, then by the time it took you to go recover the database, the power might get shut off again and you're just, you're never able to keep up. So they had a database that sort of did something like this obviously it wasn't with persistent, modern persistent memory but it was making copies of the database paying the penalty of doing random writes to disk in exchange for faster recovery times. The most modern systems don't make that choice and in our case here with right behind logging we'll be able to get good performance at runtime and good recovery time. So again, in conceptually our setup is like this, we have a table heap, say we wanna run this query that's gonna update a tuple. So the table heap will hang out in memory in DRAM and then it'll be a second copy on persistent memory and then we'll have our log on persistent memory as well. So now when we do an update, we first update the tuple and table heap, then write the change to, sorry, in memory, then we write the change to MVM but then we also now just write some metadata to the log to say, oh by the way, here's the pointers to the tuple in persistent memory that change so that we know if we crash before transaction commits we can use this to figure out where do we need to go and make sure that those changes aren't persisted when the system restarts. So how this is all gonna work is that we're gonna rely on multi-versioning and we're gonna assign transactions time stamps just as we normally would and when we go to flush out changes for transactions we're gonna figure out what is the range of transactions of inflate transactions that are running right now and just record that in our write ahead log or as a write behind log along with pointers to the tuples that they're modifying and that tells us what are the potential range of tuples, range of transactions that should not be persisted after a crash. So now when after a crash we come back again we use this failed group commit range to identify what tuples are not valid so we don't have to look at individual time stamps we just look at, is the tuple I'm looking at have a time stamp that falls in this range and therefore I know it comes from a failed transaction and I can ignore it and so what you're essentially getting is like the, you're getting the undo operation sort of free as you're normally doing the visibility checks for multi-version concurrent control. So it'll make more sense in the next couple of slides. So when I recover, I only have an analyze phase. The analyze phase looks through the write behind log and says, here's all the time stamps of the transactions that didn't commit successfully. Then I immediately start processing transactions but now I've computed this global range and says here's the range of transactions that if you come across a tuple that was modified or created by them, a version that was created by them you know we should ignore it and then reclaim the space. So you're sort of doing like cooperative garbage collection as you're going along and identifying tuples that shouldn't exist. So let's look at the example. So say here that this is our timeline of going forward in time and we're gonna keep track of as transactions get started and when they commit. So T1 starts here. So for the current range of actual transactions we know it's between T1 and T2, right? And there's no failed transactions because it's the first time we turned it on the system. So let's say now here before T1 commits, we crash. Then when we come back, say the next transaction that starts is T2. So the only thing that we needed to do after the crash is when we scan the log, we would find an entry here to say the last range that we knew about from the last group commit was between T1 and T2. So now the current range of actual transactions is T2 to T3. So now as T2 runs and access to the database if it comes across anything that was created by T1 it knows that it should be garbage collected and cleaned up. So go ahead and just start removing those things. Say T2 commits, but then T3 starts same thing our actual transaction is T3 to T4 and since we know the pointers of the tuples that T1 modified we would know whether we've cleaned up everything that T1 has touched yet. And let's say in this case here we don't have a transaction that touches what T1 and T2 did it does. So we could run the background vacuum in a separate thread just to scan through and find all those things and reverse them. So that way we're not stuck with versions that nobody were accessing and it's wasting space. Right, and then T4 starts. And at this point here we know that we've gotten everything that T1 modified so we can then remove it from our list of failed ranges. So I'm gonna show you now the performance of what right behind logging can do but I'm gonna do it the opposite of what we normally discuss how we normally discuss things with transactions I'm gonna show you the recovery time first. I have normally you show performance first before recovery. So this one it's running replaying right behind log and right behind log and right ahead log of one million TPCC transactions and this is running on that Intel emulator that I mentioned where you could tune the latency of persistent memory relative to DRAM. So in this case here we're making the speed to ax the latency. So also to the latency here was symmetrical so the read and write costs were the same and the sequential and random accesses were the same cost and then the reads and the write costs were the same as well but in real hardware the reads are gonna be faster than the writes. So we're gonna pair the time it takes to restart the system and recover the log put the database back to a correct state for a spinning disk hard drive, a solid state flash drive and the persistent memory emulator. So what you see is that with right ahead log the performance of the different approaches are roughly the same right because it's gonna be the cost of or this is also on log scale but it's the cost of accessing disks and then replaying the log. The combination of the two of them are about the same. The difference though now with right ahead logging the recovery time is 1000x faster because all you do when you turn the system on is you just check to see that right get the failed timestamps from the right behind log and that's all you need to do to say I'm now my databases has been fully recovered at least at a logical level. So what this is showing you is that the performance benefit of right behind logging for persistent memory as well as the older storage devices is about the same it's about 1000x. So now you look at this and say well sure right behind logging is faster than right ahead logging but why do I have to do this in persistent memory because I'm getting amazing performance of sorry amazing recovery time for these other storage devices as well. Well now if we go look at the runtime performance now you see a big difference. So for the runtime performance of the right ahead log you see that of course as you add as you have a faster storage device you're gonna get better performance. So persistent memory is faster than the Salisade drive or the spinning star drive and that's the bottleneck of when you commit transactions at least in this case here for TPCC. So that's why persistent memory is the best performance for right ahead logging. But now if you do right behind logging now you see why you need persistent memory because in the case of the spinning disk hard drive and the Salisade drive you get a 10x reduction in the runtime performance when you use right behind logging but only if you use persistent memory you're gonna get a 1.2x speed up performance. And this is because and with the persistent right behind logging with persistent memory because you have fast random writes to the table heap on persistent memory you know that's random IO which is gonna be terrible on these types of slower devices. So the combination of right behind logging and persistent memory is the right combination is to get the best performance in this scenario here. Okay, so just to summarize what we talked about. We talked about how to if you know you have persistent memory that's byte addressable you can reorganize your storage architecture to take advantage of that and reduce the amount of data duplication and redundant copies of data during update transactions. And then we saw in the case of right behind logging is the technique to allow you to get better recovery time because again you're taking advantage of the fact that you know that you have a protocol that's writing to persistent memory and therefore you can set yourself up the system up so that upon recovery you have to do minimal amount of work to put the database back into the correct state. So as I said, I think for persistent memory this hardware is available now and once it becomes more prevalent and more commoditized you're gonna see a lot of database systems coming out to take advantage of this. In the very beginning they're all just gonna use it as larger cheaper memory that's a little bit slower and some systems will be more sensitive than that but if the prognostications about the limitations of scaling DRAM if those turn out to be true, then everyone could be switching over to something that looks like an Optane Dim. Okay, so now I wanna finish up talking about some computational acceleration we can do in our databases and explicitly I wanna talk about GPUs. As I said for FPGAs, they're used in, they're becoming a little bit more common now sort of because you can get them on Amazon you can get EC2 instances with them but as far as I know there's more GPU databases around today than there are FPGA databases. And I think partly is that because the overhead or the sort of engineering cost of writing FPGA code versus writing GPU code or just using a GPU library like the stuff that NVIDIA provides like that's the most, the bar entry to get advantage of new hardware or sort of non-traditional hardware in your database system is lower with GPUs than FPGAs. Okay, so if you wanna use a GPU in our database what do we need to know about? Well, we need to know about what GPUs are good for. Well, GPUs are gonna contain thousands of cores and but the type of computation or the programs that those cores can execute are going to have to be less sophisticated and less complex than what you would normally run on a full-fledged Xeon core. This is because these cores are designed to do relatively simple operations can relative to what the Xeon can support on that are repetitive on large amounts of data, data streams, right? So you wanna be able to say like the best case scenario would be like doing a sequential scan on a bunch of columns or a single column broken up into different chunks and you can blast that out across all the cores and all the cores are doing the same thing and there's no in direction, there's no conditional branches. It just says I'm gonna, from beginning to end I'm gonna apply the same filter and over and over again. So again, this basically what I was just saying but like the kind of things we wanna push down to the GPU or anything that don't require additional input to make decisions about what to do next or require you to the program to do branches like if causes things like that. So what's really good for it, sequential scans, what's really bad for it, B plus trees, right? Cause that's like looking at the contents of a B plus tree node and making decision on where to go. Now there are proposals in the research literature to build like B plus trees or other tree-based data structures that you can run on GPUs. To the best of my knowledge, nobody's actually using them yet. And again, this is not my area of research and so I don't know what the limitations of them are but for the most part, people aren't doing transactions on, people aren't doing transactions on GPUs they're primarily being used for OLAP workloads. The other important thing we need to be mindful about is that although GPUs are gonna have a lot of memory now it's not gonna be cache-coherent with the CPU. So that means that if you want to do, if your database is being updated and it exists up in DRAM or even in SSD, you either need to copy the whole thing down to the GPU with all the updates or do like a merge operation to apply those changes incrementally. But it's not like if I have, you know, I have my data's in DRAM and I do an update there, the GPU is not gonna magically see that change. We have to explicitly send a message down to say, here's the new data. So again, the idea here is that we wanna figure out what computation we can offload to the GPU and it's gonna be mostly sequential scans. There are implementations for pretty much, I think all relational algebra operators we'd wanna execute in a, in DRAM queries. But sequential scans are the sweet spot. If there's hatching implementations there's sorting algorithms obviously that are designed to avoid these conditionals and branching. So the high level architecture would look like this. So say that this is what we've been talking about in the entire semester so far, that we have a CPU or a multi-socket system and then we have our database hanging out in DRAM and we can be aware of the new regions to recognize what DIMs are closer to a given core. And so over here now in the PCI Express bus is gonna be our GPUs. And we can just sort of think of them as just another socket that has way more cores that look a lot different and they have their own DRAM as well and they're not gonna be in sync. Right, so I think like for GPUs in 2020 I think you can get ones maybe up to 100 gigs of DRAM like obviously on the high end ones whereas like on the CPU we can get I think up to 48 terabytes of DRAM. If we have a lot of money. So the other thing we need to be mindful to is the bandwidth between our compute and storage. So to go between DRAM and the CPU core with DDR4 we can do about 40 gigabytes per second. Over the PCIX, PCIe bus, the best I think we can do now is 16 gigabytes per second. So it's not that far off it's not like an order of magnitude but it's still significantly slower than what we can do over here. So that means that one of the challenges we're gonna face is that if to run a query we have to send a bunch of data down here and then be able to crunch it and get it back that's gonna be, it may just be faster just to run it up up here with the CPU. We're not gonna have as many cores but this bandwidth is gonna be our bottleneck for us. So now NVIDIA has this thing called NVLink that gives you 25 gigabytes per second in between two different GPUs. You mess just passing in between them. You can also get NVLink to go from the DRAM memory up into the CPUs memory at 25 gigabytes per second. That though is this NVLink technology as far as I know is only available on power PC machines. I don't think it's on X86. So you have to run on IBM power in order to get advantage of this. I think Intel has its own fabric but I forget what it's called. Our AMD might have something as well. Okay, so how would we organize our system? So there's three different approaches we can take to how we wanna use a GPU in our database system. So the first is that the easiest way is just take our entire database and plop it down or copy it down to the PCIe bus, put it on the GPU and now all the queries only have to touch data that's down on the GPU. This obviously is limited to the amount of VRAM that's available in the GPU and if your database exceeds that size, then this won't work. Maybe you could daisy chain a bunch of GPUs together and have the CPU coordinate about who has what data and combine the results up in the CPU across the different GPUs. But it's best by knowledge that I don't think any of the major vendors do this anymore. This is actually what OmniSci used to do when it was called MapD and they've since re-architected it to do the third one here. So the second approach is that you recognize that for some queries or some databases, you maybe don't need all the columns for a table down on the GPU. So you only copy down the parts of them. Again, now your query planner can recognize, all right, for this part of the query I can run on the GPU because those columns are down on the GPU. But then I'll get back some offsets and I'll copy the tuples, copy the values that are needed for those tuples based on those offsets for the other columns that are up in my memory and then I use materialized results and return them back to the application. So for this one, where I've seen this done is usually requires the administrator to identify that these should be the GPU resident columns and these should be up in the CPU. And that has limitations because people may not always know how to pick what the right approach is. There may be some other systems that can configure that out for you automatically now. The best approach though is to do support streaming algorithms where I can have the data on the fly move data from the CPU memory down to the GPU memory and process it incrementally. While I'm continuing to send down more, I send down the first batch of data, the GPU fires up and starts crunching on it and then the background I now start streaming down the next wave of data that I'm gonna need. And so by the time that the GPU finishes processing the first batch, the second batch is ready to go and it just keeps on going. So the GPU is always being fully utilized. And there's hash join, there's sort merge join, there's a bunch of different algorithms that can implementations of algorithms that can do that for you. So hardware accelerator databases were something that I was very interested in a few years ago. We end up having a seminar series at CMU on this very topic. So I invited most of these vendors over here, the OmniSci, when they're called MAPD, Connecticut Blazing, DB, Screen, Brightlight, AriesDBs from Uber, that came out after we had the seminar series. But the point I'm trying to make is that if you're really interested in this topic and want to learn more, you can go to this URL here and I have a whole, I think there's eight different, eight or seven or eight different lectures or talks from all these major database vendors that are building GPU databases and they'll tell you what makes them interesting and how they work. The last thing I want to talk about is Harvard transact on memory. So every year I debate whether I want to bring this up because it keeps getting turned off by Intel because of security leaks. So the way to go about Harvard transactional memory is that think of like you have a critical section in your code and now I can have a heart or transaction managed by the CPU that keeps track of the loads and store operations into memory for my transaction. And then if I now, if it determines that there's a, that there was another thread that running also a transaction that maybe read or modified the same things I touched, then it'll go ahead and abort me and restart me. So the way it works is it basically operates like OCC and it'll maintain a read-write set in a private workspace for your transaction. And then when you go ahead and commit, they check to see whether they do a validation to see whether anybody else has modified the same locations that I read or wrote. And it's kind of cool because the way it basically works is that they just piggyback off of the cache coherency protocol it's using anyway to keep track of, to keep the cores in sync. And there's using that to figure out when there's conflicts between transactions running at different cores. So the, so this is, this was, this was actually invented by Maurice Hurley. He who, the guy invented linearizability. He used to be a professor here at CMU but now he's at professor at Brown. He invented this I think in the early 1990s and actually Intel put it in their hardware. I think starting in, they announced it in 2012 and then it came out in 2013. But then they found a bug in it in 2014 so then they disabled it. And then I think in like 2017, they're like, all right, here's the new CPUs, the bug is fixed. Go ahead and re-enable it. And then in 2019, I think there's another, I think it's called zombie bomb or something like that. So there's another bug that can have security leaks when you use this. So this is what I'm saying. So like, I would be kind of wanting to teach this and show this and have you guys use this. But like, I, it's unclear whether that, you know, this is actually, like you could actually use it today safely on, you know, modern Intel CPUs. I don't know if AMD has similar issues. So everything I already said, like the way it works is that you keep track of the read-write set. Oh, the read-write set has to fit in your L1 cache. So you can't use this for, you wouldn't be able to use this for like, you know, to replace the transactional stuff that we talked about for concurrent control, right? Because a lot of times your transaction read-write set will be larger than L1. And certainly if you have multiple threads running at the same time, you'd be thrashing L1 and, you know, you would have problems with this. So the reason why we might want to use this also too is like the, this is not just for performance. This is also useful from a software engineering standpoint because now instead of doing all the latching, crabbing stuff that we talked about before, you could use this as an alternative and maybe get the same performance of having software managed latches or software managed transactions with lower engineering overhead. That's purely conjecture. I don't know whether that's actually true, but that's sort of been what the proponents of this technique have argued. So let's see how you would actually want to use this. So with Harvard transactional memory, there's sort of two programming models that you can use. The first is called Harvard Lock Allegiant. And the way this works is that start my transaction and anytime I do a write during my transaction, I don't actually do it. I just sort of like to do a Jedi MindTick trick. I sort of trick myself thinking that I did do it. And what'll happen is like, so if I write to a memory location, it's hanging out my private workspace. If other threads try to read that memory location, they won't see my write. And then when I go to do a commit, the Harvard will check to see whether there's a conflict with other transactions or other threads. If there wasn't, then I can go ahead and apply my changes from the private workspace into sort of a global memory. If there was a conflict, then the Harvard will roll me back to the starting point of my transaction. Almost like a store procedure. You roll back to its beginning and then re-execute it. But now when I execute it the second time, I'm actually gonna take explicit locks to protect the memory regions that I'm writing to. And so that way I'm guaranteed that I can run without conflicts. So it's like, again, it's optimistic first and then if I get it if my, if I conflict with somebody else, then it gets restarted with the pessimistic locking. The other, the more complicated approach is to use RTM or restricted transaction memory. And with this one, it's like the hardware lock Elysian where I'm gonna run the first time without taking locks. But then if there's a conflict, instead of going back and running the transaction again with the taking explicit locks, you can, you provide it with a pointer to another location in the code to jump to that will do something different than this, the regular code. So you still abort the transaction and you roll it back, but you don't jump back to the starting point of the transaction and run it again. You jump to some other memory location for the program that could do something slightly different. So this requires more engineering effort on our side as the system developer to be mindful that we're jumping to another location and have sort of an alternative implementation of the critical section that we're trying to protect it. So let's look at an example of how we could use it. So say that we wanna do, we have a B plus tree and we want to insert into key 25. So if we're just doing the latch crabbing that we talked about before, the optimistic latch crabbing, I would take re-latches till I get down to here and recognize that this is the thing I wanna modify. So I take a right latch or that's X for exclusive, which would be a right. And then when I have that, I can go ahead and apply my change. But now if I'm doing this with Harvard transactional memory, my program would look like this. I would have the boundary from where my transaction starts and when I commit. And for this critical section here, this is just the crabbing portion where I am traversing down into the tree till I get to this F node and I can take the exclusive latch on the, or the right latch on F. So from the outside, it looks like that I magically made it through down here transactionally and this would automatically detect whether somebody else took the latch at the same time I would. And so from the outside, it looks like that I magically warped down here. I took the correct latches all the way. So I'm guaranteeing the integrity of the data structure in terms of what the pointers are pointing to. But I didn't have to have to actually apply these rights to in memory, all that got alighted and I could jump down here and get exactly what I want. Okay, so to finish up, as I said, we spent most of the time talking today to talk about persistent memory because it is my opinion that is out now and that when it becomes more widespread that this is gonna be a major change in how we build software and especially database systems. Like I could foresee that if persistent memory takes off as much as I think it should and will, then it may be the case that we don't, in the introduction database class, we don't spend time talking about things like buffer pools and how to do right-hand logging with, you know, maximizing sequential IO and things like that. Now memory is just persistent and I can write to it. I make sure I flush it. I have to order my rights a little bit but I don't have to do all that page latching and all the crap we did in the introduction class. So it's my conjecture also as I said earlier that I think the in-memory databases are in a better position to take advantage of persistent memory because they're already written to assume that they can talk to bite-addressable memory and not deal with pages. So I think that the conversion from a DRAM or in-memory system to a persistent memory system will be lower for in-memory databases. I also think that the GPU has been around for a while, the FPGA has been around for a while but like I said, there's not, most database systems are not being written assuming you have that hardware. It's still on my opinion, it's sort of a niche market. What could happen though is that beyond GPUs and FPGAs, you may start seeing additional computational devices like configurable spatial accelerators, something that looks like a TPU or some kind of custom ASIC that could make a big difference in the performance of databases. The only thing that would have to have to overlap pretty heavily with potentially machine learning or data science applications more than just doing sequential scans which data science machine learning do do a lot of but there's a way for a dataset to take advantage of them, I think that would be interesting. As a sort of, also on the other side too, I think matrix databases will have could potentially become more important in the next decade and in that case, some of the accelerators that are out there that are doing computational or machine learning on matrices, those databases could easily take advantage of those things. But the important thing though is like the core ideas of the things that we talked about this semester and certainly many of the algorithms are still the same, sequential scans, sequential scan, evaluating a predicate, co-gaining that or traversing the tree, that is all pretty much still gonna be the same and so we just gotta think about how we, taking the knowledge we know about how to bit a dataset and the way that we talk about in this class and then apply it to this new hardware and so I think you have the background to do this now. Okay, so this is the last lecture. On Wednesday, we're having again the guest speaker from Amazon but that will be closed off to only CMU students. So if you're watching this from outside of CMU, hope you're safe, hope everything's doing okay and hope that you're watching this well after we've passed the pandemic. If you made it through the entire semester watching the YouTube videos, congrats. So what should you be able to do now? Well, after going through 25, 26 lectures, you should now be able to understand and comprehend and reason about the major topics that we talked about and on how to build a modern single node data-data system and I'm qualifying what I'm saying by single node because going distributed brings up a whole other bunch of issues with networking and distributed transactions that we haven't talked about but at a high level, there are many of the same issues that we talked about here, where we care about placement, we care about partitioning, we care about joint algorithms, we care about transactions all in that environment, it's just the communication is slower and more unreliable so we have to account for that. I also hope that you now have the reason to the ability of the foundation to now reason about the claims that people will make about their database management systems and you'll be able to figure out whether those claims are real or whether they're actually implausible or whether it's just a bunch of marketing hype because the database market has a lot of money, there's a lot of startups, there's a lot of big companies who wanna make money and sometimes they'll say things that unless you're paying attention, you're like, oh, that seems kind of cool but if you're now using the knowledge you gained by going through this course now you should be able to look at this, is this real or not, right? So I don't mean to pick on these guys but let me give one example that came out last week. So there's this new startup called TerminusDB and they posted on Hacker News, hey, we've come out, we're around, this is what we do. So it's a graph database and so they're comparing themselves against a bunch of other systems in the same space and they had this little blip here though that talks about their features that they do AI cogeneration and rightfully so you see this and like, well, what does that mean? And sure enough, somebody posted on Hacker News saying, hey, look, AI cogeneration, what the f*** is that? What are you actually claiming? And then these are some of their developers and like, oh yeah, this is incorrect, this is marketing gone wrong. So if you're just a casual person looking at this and go AI cogeneration, well, that sounds kind of cool, what could that be? But for you guys who have gone through this course, you know what cogeneration is, you know there's no AI aspect really to it because you're taking a query plan that the optimizer spits out that it's only going to have one way to execute it in terms of how you're going to generate the code for it and so there's nothing AI to this. So again, this is just sort of something I'm hoping you guys can now do on your own. See things that people are saying and use your background you've gotten from this course to decide whether they're saying things that are plausible. Okay, so again, next class, we'll have the guest speaker from Amazon and then for everyone else, take care. What is this? Some old bullshit. What? Took a sip and had to spit cause I ain't quit that beer called the OE cause I'm old cheap ice cube down with the STI. And look, then it was gone. Grabbed me a 40 just to get my buzz on cause I needed just a little more kick. Hooked like a fish into my lips and ripped the top off. They put on just dropped off. Just saying eyes hopped off and my hood won't be the same. After ice cube, take a same eye to the brain.