 This is the last lecture for the class, so this lecture is a bit more, I don't want to say abstract, but it's sort of me just dumping a lot of information on you guys about new hardware. And the idea here is that if you're building a system from scratch today, you may not actually want to use anything when I'm talking about here, but this is like going forward in a year for now, depending on some of these things. This is something you may want to explore, right? So, from Administrator things real quick, as a reminder, on Wednesday, there's three things happening. We have our guests from Snowflake, we'll be giving a guest lecture about their database system in class. I'll be also handing out the final exam. The final exam will be due on the day of the final presentations, May 14th, so you have basically 12 days to do it. And then also at midnight, you'll have to do your second code review. So, the way you should do this is you should just do an update to your pull request on GitHub. You can close the other one if you want, but ideally I want to have all the comments in there. Actually, it might make sense to actually do a new one, because that way, yeah, actually post a new one and that way I can see, you can see the old ones and see the new ones. Yes. Didn't you say you're going to push it back to the fourth? Yeah, correct, yes, ignore this, the fourth. Yes, thank you. Whatever it says on the website, the website, yes, ignore this, website real. And then the final presentations will be on May 14th. So another one is I looked it up, the room socks that's in the other building. As far as I can tell, no one is using this room at 8.30 in the morning on that Monday, so let's just use that. Because I know in Doherty Hall, it's like the huge room we used to teach the intro database class in. And this is a bit more intimate, so I'd rather do this. Is that a problem for anyone? I don't think so. So same time, just didn't hear, and I'll send a reminder out to everyone when we get closer to this. And then the other major thing is that you've all should have gotten emails from the university about faculty course evaluations. So I actually do read these carefully, and I actually try to help or try to fix things in the course from year to year. So please, please, please, I remind you multiple times, please actually fill this out. I don't care if you badmouth me, you can say whatever you want. And I actually will take into consideration and try to make the course better. So last year, I want to say names, last year there was somebody who was dating someone who's taking the course now. And apparently they had a bad break up near the end of the semester, and they knew the other person was taking this class. So they said, Andy should give midterms in the class, and that's why you all had a midterm plus the final, right? Another one that was actually super useful is this person here telling me how terrible I smelled. And frankly, he was right, right? I had this weird bad body odor problem. And as far as I can tell, since I started showering every day, this is no longer an issue. So again, if you guys provide me useful feedback and the faculty course evaluations, I will actually try to rectify things and fix things, okay? Any questions about that? What's that? Because I'm teaching the class, don't write about your evaluation for Joy, he's not here. Write about me. Apparently the department also reads these too. All right, so today's class, as I said in the beginning, it's really about exploring what other types of hardware that's out there can be applied to our database system. So we've been focusing on sort of a classic Von Nomen architecture of you have a CPU and a CPU has caches, and then there's main memory, and then your program that you load into the address space. So we want to look to see whether there's new hardware that's available to us that may not fit this exact model that we could use to accelerate and speed up our database system. So this is not a new idea by any means, although the hardware is new. People have been thinking about this for a long, long time. So going back in the late 1970s, early 1980s, there was this sort of class of database systems called database machines. And sort of think of these as specialized hardware that was custom to running an actual database system. So this could be things like computational accelerators to do hash joins very efficiently or custom hardware that would be embedded in the storage devices. So you can do predicate pushdown instead of actually having to, you know, as you stream the data off the disk and bring into CPU, you can apply your filter right then and there to reduce the amount of data you had to look at, look like, look at. But these sort of fell through and didn't actually come very prominent just because the hardware was always getting better so quickly. So commodity hardware. So by the time you actually fabbed out a database machine, Intel had a new release or IBM had a new release, and then whatever the benefits you were getting from your custom hardware were just overcome by improvements over the commodity hardware. So nobody really makes database machines anymore. Although you could argue in some ways, you know, some oracles like Exadata or Rack, you know, those are sort of custom hardware appliances or custom hardware. You could argue that those things are database machines, but nobody really uses the term anymore. The 1990s, as far as I could tell, nobody was really using custom hardware. In the 2000s is when we saw sort of two movements. There was the adoption or adoption of using FPGAs to, again, sort of do custom filtering or custom database functionality in a more programmable hardware, pushing, you know, things down close to the storage as possible. And then you also saw the movement towards what are called database appliances. So sort of think of these as a commodity hardware, but the database system has been sort of pre-loaded and tuned specifically to that hardware. So you could buy, you know, a single one-rack unit that would have Oracle or Cluster installed, and you wouldn't have to tune any of the OS or kernel parameters. Everything was to be set up for you. These also sort of fell through the wayside since Amazon now, you know, everybody has to be able to run on EC2. And then now in the 2010s, I think the two major movements we're seeing are still FPGAs are still interesting. People are still applying them and using them in, again, to speed things up. But now we're ending the era where GPU-based databases are becoming more prominent. And again, the same idea. You want to offload computation-expensive operations that would normally exceed in the CPU. We want to run it down on the GPU. So I'll talk a little about that today. We're going to spend most of our time talking about non-volta memory, because this is something I have done a lot of research in the last couple of years on, and I could talk exhaustively about this. But I'll spend a little bit of time talking about the end of how to use GPUs to speed up the data systems, and then just a little bit about how hardware transactional memory can be applied in the context of transactional workloads to speed things up. So the non-volta memory is not a new idea. People have been sort of thinking about this since the late 1980s. The basic idea, the way to think about it, is that to the database system it's going to look like DRAM. It's going to have roughly the same read and write speed or latency of DRAM. But it's going to have the ability to persist all its writes and maintain durability for all the writes, even if you lose power like in SSD. So the big distinction we saw in SSDs versus DRAM, DRAM is super fast, and you can be byte addressable. You load in stores into cache lines. But again, you pull the power, you lose everything. In SSD, they have much larger capacities, but they'll be slower and have this block-oriented device. So with NVM, it's going to be sort of the best of both worlds in some ways. It's going to look and smell like DRAM, but if you pull the power, everything is going to be persistent. So the first devices that are actually available today, we actually have one or two of them here at CMU, these will be block-addressable. And so you have used this protocol called NVME, which I realize is sort of overloaded term and maybe confusing. But the NVME basically allows you to do kernel bypass, to do reads and writes to blocks, to NVM devices that are running on the PCI Express bus. The later ones that are coming out, I'll be careful here because I don't want to bleep everything. So I know a lot of things that I can't tell you because I'm under NDA, so I'll be very, very careful not to say anything that I shouldn't and so I don't get sued. Okay, so some point later devices will come out this year or next year, this year, and then they'll be byte-addressable. So it's going to look, from your application standpoint, it's going to look like DRAM, but there will be some special stuff that you can do to make sure that if you write something to it, you know that if you come back after a restart, everything's still there. So the history of how we got to this point actually I think is quite fascinating. So this idea again of nonvolta memory is old and people have been sort of dreamed about or have used battery backed up DRAM since the 1980s, but it's really now that we're having storage devices that are based on new storage mediums that don't require batteries that are truly passive circuits. So the history of how this was found I think is actually really fascinating. So I'm not an electrical engineer, but if you've ever taken an electrical engineering course, you sort of start off the fundamentals of the basic circuits, the kind of circuits you can have. And then you start off the first one which is a capacitor, and this is identified back in the 1700s, right? And this is the basic idea. You can change the, the capacitor is basically just a battery. You can put a charge in there and then offload it later on. And then in the 1800s they found, they came across with the resistor. The idea here is that you put a charge in and then you can change the resistance, or you can set a resistance to have a different charge of voltage coming out. And then the last one is an inductor found in the 1800s. And think of this as a heating coil, right? You put a voltage in and then these things can offload the energy and generate heat. So again, if you take a basic electrical engineering course up until recently, they would describe the circuits in terms of these three terms. Now, what happened was in the 1970s there was this guy named Leon Chua, who was a brand new professor in electrical engineering at Berkeley. And he worked out the math and he figured out that there had to be a fourth element, so that the equations wouldn't be perfectly balanced if you just took the three elements. But if you had this fourth elemental circuit, then everything actually worked out correctly. And what he hypothesized was that this other new passive circuit would be this two terminal device that would have its resistance change based on what voltage you apply to it. But then when you stop applying that voltage, whatever you set its resistance to would persist, even if you didn't keep giving it power. These are passive circuits that don't require to have continuous power going into it. So the idea is that you would have this special resistor that can remember whatever it was its last resistive state was. And so he ended up calling this the memrister. And so it sort of fits in like this. And the reason why he would argue that this had to be the fourth missing fundamental circuit is that you couldn't build this element from these other three primitives. It had to be its own atomic thing. So he wrote this paper in the 1970s. It was very mathematical. No one read it, right? And it sort of just went through the wayside. And I'm sure he kept working on this, but no one really recognized what this thing actually was. So now you flash forward into the 2000s. And there was this team at HP Labs in California led by this researcher, Stanley Williams. And they were trying to build these sort of self-assembling nanodevices in the lab. And what they found was that they kept coming across these... When they did the experiments on these devices, they would see these weird measurements or these devices that had these weird properties that they couldn't understand. And by pure luck, they basically tried to find as much literature as they could on sort of different devices that exhibited this property. And then they stumbled upon Leon Chua's paper from 1971. And then that's when they realized what they actually had developed was the missing membrane resistor. And then they got in touch with him, and then they had this big kumbaya moment and they realized, okay, we actually figured this out. So there's a great paper called How We Found the Missing Merman Resistor from 2008. And then there's a subsequent paper in Nature published, I think, right around the same time where they actually went back into the annals of scientific literature. And they found all these publications where other scientists from decades ago would report these weird findings in these devices when they applied electrical charges to them. And they would have the same hysteresis loop graphs that exhibiting all the same behavior. And then what they basically found was like everybody else was inventing memristors back, you know, going back to like 1920s, 1930s, but nobody knew what it actually was. Everyone just reported, hey, we found this interesting phenomenon. We don't understand it. Here's the graph. But again, if you actually chart to see what they actually look like when you apply voltage to them, you'll see that they have the resistive properties you'd expect in a memrister. So another confusing thing about this is that the fourth fundamental circuit will be called a memrister. HP is marketing their device as a memrister. But there'll be some other things we'll see in a second like what Intel is putting out. It's technically a memrister as well, but the actual storage medium it's using is something different than what HP uses. So I'll try to clarify that. So everything we're talking about here for non-volta memory, these are all technically considered memristors. HP calls their device a memrister. All right, so there's basically three core technologies that people have been pushing that look the most promising to actually achieve a commodity hardware non-volta memory that can be cheaply made, cheap in quotes and widely used and have all the properties we're going to want. So the first one is called phase change memory or PRAM. So this one is probably one of the oldest ones. And I will say this is actually what Intel has for their non-volta memory device, the Optane memory. And although that's under NDA, I can tell you this because some dude in Korea took their device, popped it up and put it underneath the electron microscope and figured out that it was phase change memory. So I won't get sued for it if I tell you. So the basic way to think about how phase change memory works is that you have this calcium-enhyde storage medium here. I think if this is like a crystal. And then what happens is depending on what kind of charge you put into it, you'll change it to be whether it's transparent or whether it's opaque. And that tells you whether you have a 1 or a 0. So the way to think about this is you have this access line that's going into it and then you put either a long pulse or a short pulse into it and that'll change the composition at that sort of nanoscale of what this device is actually made of. You can draw a little heater here but obviously it's not like someone with a big lighter, right? Just a small nanowire going into it. And then based on that, that'll tell you whether you have a 0 or 1. Now, this has actually been out for a while. Like prior to Intel's announcement about their Optane device one or two years ago, you could buy phase change memory but it's not at the really large capacity that you would need to actually have it be a replacement for DRAMP. So you could buy small like 120 megabyte modules of PCM and put that in your cell phone but obviously that's not going to be a replacement for DRAMP. One issue also to a phase change memory is that applying this voltage actually generates heat so that sort of limits how much you can pack into together on the die. And then you can also only write to a single cell so many times before it burns out. So the burnout rate for this is undisclosed. The preliminary numbers are basically saying that you can read and write to a cell one order of magnitude more than you can with a nanoflash cell but it's still not like DRAMP or SRAMP. You can basically write to a cell infinitely before it burns out. The next one is called resistive RAMP and this is what HP invented when they found their device or found their memristor. The way it works is that you have two layers of platinum and then in between that you have two layers of titanium dioxide. And so titanium dioxide is the same thing in white house paint or the same thing in suntan lotion. So it's like super common platinum obviously less so. And basically how it works is that if you run the current in one direction through the titanium dioxide layer that'll move a electron up or down between them and then that ends up changing the resistance of the device. So again if you have one electron at the top you get a one. If you have a new electron at the bottom then you read it and get a zero. Again these are solid state devices. So what's actually really cool about the memristor and this is why I drank the Kool-Aid when they announced this in 2008. I was like man this sounds awesome. This totally sounds like the future of computing. They talked about how the storage fabric for a memristor could be configured to be either actually used for storage like storing ones and zeros or you could convert to be executable logic gates like an FPGA. So the idea is that you could then have your memristor sitting in your dim slot half of it could be to actually be storing your database. The other half could be whatever you want like doing predicates or building an index inside of it. Actually you can have code you can actually execute on the memristor and they were talking about how they were building like a neural processor or something on a memristor but at this point this is a decade ago and I don't know what's happened with this. But what's even cooler about all this is that the way you actually program the storage fabric is not through like the standard NAN logic gates that we use in the CMOSs of our existing computers. You actually use a different type of logic called material implication which was invented by Bertrand Russell from like 1921. It's just a different way of expressing logic in your program and obviously Bertrand Russell did this before computers actually existed but it's just showing his mathematical principles can be applied from over a century ago to modern hardware. So as far as I can tell I looked last night I'm not sure whether HP actually released anything. They seem to have a website that claims that there's something there but it's not like you can buy it yet. HP's always been very disappointing for this because it's whatever you know when they announced in 2008 they claimed everyone would have memristors in two years, two years later everyone said it's two more years later and then it's always been perpetually two years later. So as of 2018 it still looks like it's two years away before everyone else can have it. So memristors have also the same problem as the phase change memory. You can only write it to write to it so many times and then the latency for this is a little bit better than phase change memory but it's still not as fast as DRAM. The last one is much farther away but this is the one that's probably... when this happens this will be a big, big deal and this could actually be a true replacement for DRAM whereas I think for phase change memory and the resistive RAM from HP those will sort of be an additional layer of storage between DRAM and your SSD whereas with the magnet of the resistive RAM or Spintronics these are reported to have super, super small scale and have the same speeds of almost like SRAM. So SRAM is like your L1, L2, L3 cache. You're going faster than DRAM and you could have capacity of the size of an SSD. So this would be fantastic if this actually happens. So what makes this actually different is that it's going to measure the... it's going to use magnets to figure out whether there's a 1 or a 0 rather than flipping the resistance of the actual device itself. So Samsung is probably one of the biggest people in this area but this is probably, I've had to guess, ten years away. Where's phase change memories out now and supposedly the resistive RAM from HP is out soon as well. So if I told you though that if I just said oh would people think about non-multi-memory for a while why am I so optimistic about this is actually happening for real other than my student has been doing research in this area and he's graduating and we're kicking him out the door like why do I actually think this is actually happening for real and this is actually something that we as database developers need to start taking into consideration. Well the answer is because there's actually been three major changes in the last two years that you need to actually have in place in order for you to actually support non-volatile memory true non-volatile memory in your database system. So the first has sort of been obvious that the major players in the industry have all gotten together and they now they have standardized what the form factor is for these different devices and how they're actually going to talk to the motherboards. So now they've codified how you'd actually build a non-volatile memory device what the dim slot is going to look like and how you actually communicate it with the rest of the hardware. Linux and Microsoft has also added support for NVM in their kernels. This happened within the last one or two years. So this sometimes is called DAX direct access storage. Again this is now adding the ability for the operating system to recognize oh I have this new hardware that is actually persistent memory and I need to do something to it and then I can expose it to the application which is also the database system and have the application know that it's dealing with non-volatile memory or persistent memory not just you know it looks like DRAM but it has this sort of extra you know extra ability. And then the last piece is super so important is that Intel added in early 2017 they added the new instructions to the x86 instruction set that allow you to do cache line flushes to NVM. So we'll see this in a second but the big issue is that from the programmer standpoint the application standpoint we do loads and stores into memory and they'll land in our CPU caches that we don't have any control over right. So we could do a write to memory and we want that to be persistent and it would land in our CPU cache and we have no way of doing a flush to that memory and to know that it's actually been made out to NVM so we can be guaranteed that our transaction or whatever change we just made is actually durable. So Intel added the CL flush, cache line flush and cache line write back instructions to allow us to have full control in our application in our database system to do writes and have them get written out to NVM and we can block and essentially do it in S-Fence and block to make sure that we don't return control to our thread until we know it's been written out. So the combination of these three things is what you need to actually have NVM actually be useful. So to give you a little bit about some background of what NVDIMs are going to look like. So again, the idea is that, as I said, the first devices that are available now, the Intel Optane, is basically a PCI Express card that has the non-volta memory storage but then they take that storage and now they're going to put it into a DIM form factor and because of the changes to the kernel and because of the instructions, we can now do reads and writes to locations in memory and know that we're actually doing something, writing to persistent memory. So the first three devices have been around for a while but again, you need to be paired up with the kernel to actually support this. So the first form factor is called NVDIM-F and basically this is just a DIM card with a DIM stick that fits into the DRAM slot but everything's just flash and you have to be paired up with DRAM in order to buffer your writes which then get later flushed into the flash. The next iteration is called NVDIM-M and this is when you have the flash and DRAM together on a single DIM stick and it just appears to the operating system as volatile memory so it just looks like you have an expanded address space and the hardware itself is responsible for managing the flushes of writes out to the SSD. The one that matters the most to us is called NVDIM-P and this is the true persistent memory using either phase change memory, resistive RAM or the Spintronics where the operating system knows that it's actually non-volatile knows that it's persistent and there's no DRAM or flash on the actual DIM stick itself. So now given this, how do we actually how can we use this in our system? So there's currently going to be three ways you can configure your application or a database system to use NVDIM. So the first is that you can use DRAM as just a hardware managed cache so in this case here we have our address space for our database system and when we do a write to a block or to a page it'll go through the virtual memory subsystem and then get written out to a physical location in DRAM but the DRAM is essentially acting as a buffer to NVM and the NVM is what is the size of the NVM storage here is what gets exposed to the operating system as the amount of memory that you have and then underneath the covers the hardware is responsible for moving things down here and back. Yes? What do we do in DRAM if it really comes true and we have special instruction stats for DRAM? So the question is why would you still need DRAM in this case here if NVM is persistent and we have special instructions to actually do this? So what I'm showing here is that these are different ways you could use NVM in your database system and here you're treating NVM as volatile memory right? It's just a larger capacity so again, so say this is one gig the NVM could be 10 gigs so the operating system is going to see that it has 10 gigs of memory even though it really has one it's just been backed by this because DRAM is going to be faster than NVM so you can absorb your writes read the writes to this much faster So this question is is NVM going to be still slower than DRAM? For phase change memory and resistive RAM? Yes, I can't say numbers but yes Spintronics will supposedly be faster than DRAM Alright, the next way to do this is to have this NVM next to DRAM and again, so now when I do a write I'll go directly from my address space for my database system and I can write directly either to a page in DRAM or a page in NVM So for this to work the operating system will expose information about the regions of memory so that you can know that here's some block of memory that I'm writing out to and it's going to DRAM and I know that it's not going to be persistent so I could use this for like a temporary buffer if I'm doing a sorting or my hash table if I'm doing a join and then here's NVM where I could write out my log because I know this is durable So again, you have application of the data and have complete control over what region of memory it writes to and it can be guaranteed whatever properties the hardware provides The last one could be a going back to some more traditional side style of NVM and this is where you basically can have the database system use a buffer pool in a traditional sense in a disoriented system and read write to a disk based file system or it can use an NVM based file system bypass the buffer pool bypass the kernel in some cases and just have everything right to this So in this case here NVM is the only persistent area DRAM is volatile In this case here, we have the buffer pool it's volatile, but then we have two regions of persistent memory and another thing that could happen too I'm not showing the line here but you could have the move data back and forth between the two layers in this case here You may, like if you have a buffer pool miss and you need to fit a page you could fetch it in and maybe move it to the non-volta memory file system first and then go fetch that into the to memory that way So again, these are just showing you that NVM is just, you know, it's not just magically as persistent memory there's a bunch of different ways that you can be able to configure through the application the operating system will expose these knobs to you and then we can design our different database systems to use either any one of these configurations I suspect that what will happen is this one will probably be the most common when the NVM actually comes out at first because it's going to be a major software change to make it use either of these one of these configurations So the main takeaway again from a database perspective if it's the PCI Express bus device and it's just block oriented it's not that interesting because it's just going to look like a faster SSD but when we actually have byte address will NVM when we can go grab single tuples doing loads and stores then that actually is going to be a game-changer for us because we have to rethink how we built our entire system So I can't prove that this is always conjecture of mine but I suspect that the in-memory database systems, right, you know, not just our own but the other systems like MemSQL that are out there that are hyper that are in memory that these guys we better position to use true non-volta memory when it comes out because the architecture is already inherently based on accessing single tuples through pointers, right? It doesn't have this whole block management or page management in the buffer pool All that can get thrown away if you can go access single tuples from NVM which is essentially what in-memory databases do So I know this I can't say who but if you think about it no surprising but there's pretty much every single database major database vendor that's still actively working their system is in the process of building specialized engines that are designed for NVM rather than taking their existing block-based storage engine and trying to refactor it and have it use NVM they're starting from scratch and writing everything they're writing a brand new system for this Right? And you can sort of think of it like as we saw this in the case of Microsoft Hecaton or MySQL or MongoDB where you can drop in different execution engines but maintain the still front end the front end layer that's essentially what they're trying to do for NVM Alright So the paper ahead you guys read was a specific implementation of a storage engine that leverages NVM in an interesting way but the precursor to that paper was another one that my student and I wrote on evaluating all the different sort of types of database system architectures you could have and how do you actually refactor them or reinvent them to use NVM correctly in the way that I just talked about So we'll go through an example of how we're going to do this for all the different type of storage architectures you can have in a database system but then one of them will be equivalent to the right-behind logging stuff that you guys read in that paper So for this this paper here from Cigla 2015 this was based on a system we were calling end store because I had our young h-store, e-store, s-store so we were doing end store this ended up being the initial prototype of what eventually became Peloton I decided to stop calling everything store and then we sort of rewrote the system and merge it into Postgres so Peloton came out of this end store project So there are some things we need to have to make sure we can use NVM correctly in our storage engine So I've already talked about this one in the beginning but if we just do all those in storage into memory traditionally this would always land in our CPU caches and then we don't know when the operating system actually would write them out so we can't assume that anything we write to NVM would actually be truly durable so my intel added the new cache line flush instructions to allow you to say take this cache line and block me until you know that it's actually made it truly as NVM So the way this basically works at a high level is that you have your database system here and it wants to do some stores, say it wants to do a flush for a transaction so you do your store operation and it's always going to land in your CPU caches first and then you use the cache line write back instruction and that will then flush it out to the memory controller and so all the new intel motherboards in order to support non-volta memory basically have a little capacitor here on the memory controller that can make sure that if the power gets cut you have enough energy to write everything out to NVM so once you do the cache line write back and it lands on the memory controller that's enough for it to be really durable and so in this case here you would return control back to the application thread but then what will happen is it's going to use this process called asynchronous data refresh and the memory control at some point will write your data actually out to NVM here and again the capacitor has enough juice to say if I lose power I can just make sure I flush everything out and we're talking about the nanosecond scale for doing this so again this is enough to make sure that we can do writes and everything will be durable even if we lose power the next thing we're going to need in our system is to make sure that all our pointers internally to our own data structures will be consistent if we restart so way to think about this is like if I run my data system and I have internal pointers from the index to some location for these tuples I want to be able to have the system crash and restart and have all these pointers actually still be valid so I want to take what's out in NVM and bring it back into my address space for my process and have those virtual pointers actually be still correct and if I update my version chain and I have a new tuple I want to make sure that if everything is blown away I come back and all my pointers still point to everything because traditionally there's no guarantee to do that so to make this work we end up having to write our own variant of malloc that provided the ability to do these flushes and this naming convention and then this is we build on top of this and generate what we call non-volta memory pointers that will guarantee that again it has all these properties if we flush everything we don't lose anything and if we restart our process and suck in back our chunks of memory our pointers are still pointing to the correct locations so back in 2014 when we did this there wasn't any libraries out there that could do this for us and that's what we had to roll around now there is actually a great library I should have put a link here from Intel called the PMDK the Persistent Memory Data Kit it used to be called PMEM.io or LibPMEM and think of this as like the STL for non-volta memory they have containers they have memory allocation you can use that in your application and get all these things for you for free so now we are going to build on top of this memory allocator we are going to look at the three canonical datavis engine architectures you can have and how can we adapt them to use non-volta memory so the first one we are going to do in place updates is with NVCC although the right behind logging paper you read relies on NVCC and the way this one basically works is that you are going to have a table heap plus a log plus snapshots like this is the standard in-memory database architecture for transactional systems we have been talking about so far and the example architecture we are using basing this on is VoltDB or H-Tor the next one you are going to have is a copy on write architecture where you have a tree hierarchy and any time that a transaction updates a page you make a copy of it, apply order to changes and when that transaction commits you flip a master record pointer to point to the new version of the tree and this is equivalent to LMDB the lightning memory map database system the last one is a log structure architecture and again this is where you have no table heap and only a log and this is based on like level D or rocks to B so the way to think about this is you only have a heap and a log and this one you only have the log so I will go through real quickly so let's focus on the in-memory update one so this is how it works normally in memory system that we talked about so far you have an index, it points to tuples and the table heap and then when you make changes you write things out to the right ahead log and then occasionally you take snapshots so the issue with this is that if I update a single tuple we end up making three writes to memory or non-volta memory for this single update so assume everything fits in non-volta memory when I update this tuple I'm going to put out a log record in the right head log I'm going to make my change into the table heap and then eventually I'm going to you know take a snapshot and write the checkpoint out to you know to a file there so we assume there's no DRAM that all of this is just sitting in NVM then we are essentially making three copies of the same change and as I said this duplicate data is problematic because you can't write to these cells an infinite number of times so for a single update you're making three changes you could burn out the device very quickly it's also problematic for the recovery latency because now when I want to recover my database system I've got to load in the last checkpoint and then replay the log in order to install the change where in actuality as long as I'm careful about how I do my transactions and my updates I could just come back and recognize that oh that my heap is persistent this is all the data I actually need to recover the database maybe I look at the log and try to figure out what changes should be rolled back because they haven't committed before I crashed but I don't need to load the last snapshot and I certainly probably don't need to replay the entire log so the idea here is that we can come up with an optimized architecture where we can leverage the fact that we have non-volta memory pointers that can point to the record that changed rather than how it changed and then when we crash and come back we just go to see what transactions were active at the moment of the crash and undo any of the changes that they had this is essentially the main technique that is being applied to write behind logging in the paper you read we only end up logging pointer or metadata about what was changed rather than how it changed and this is different than right ahead logging where you record the delta or record the tuple of what was modified and then reinstall it when you come back so the log here is just a way for us to figure out to find the thing we know that should be rectified upon recovery so we go back to architecture here and we make this now be NVM aware we get rid of the snapshots entirely because we don't need them because our table heap has everything we need and then when we make a change we just write in our right head log a pointer to this thing that says here's what was modified and then we can go apply our change now in the case of write behind logging there's some extra stuff that we're recording to keep track of what were the actual transactions at every step and that way we know the time stamp range of what things should be visible when we come back because it's a multi version environment so the idea is there, you're using those time stamps to figure out how to do cooperative garbage collection to prune things out that shouldn't be there in this case here because we're doing in place updates we just need to know that this thing is actually the correct version we should have next is the new copy and write engine and again this is where we only have a table heap we don't have any log so the issue with this is that if you have a page oriented architecture like in LMDB then anytime you update a single tuple you have to make a copy of the entire page for it and then apply your change and then update the dirty directory to say here's the new version of the tree to look like and then you do a pointer swap at the top to now have the master record point of this so anytime you update a single tuple you're copying the entire page to make that change because otherwise this dirty directory of what changes you make would be massive so you're taking a block oriented approach to amortize the cost of keeping track of these pointers so the issue is that these copies get really expensive again to make a single change you have to copy a whole page for this so the way to speed this up is that in a non-bultimate memory environment is that instead of having pages for your tuples you just have pointers because you're going to have a tree structure anyway so now we can just basically build a B plus tree where the tuples are short on the leaf pages and the dirty directory is just again another node in the tree and again it's more fine-grained so now we don't have to do expensive copies every single time we make an update again we can do this because we're NVM right and we know everything is going to be byte addressable we can always get to a single tuple the last one is to do a log structured engine right and this is high level, this is how level dv or rocks dv works you have the in-memory data structure your memtable where you have some index and then a right head log where you apply your changes and then over time you take the contents of the memtable and then you write it out to the shortest string table of the ss table so when I want to do an update I find my entry in the index I apply my tuple delta and then at some later point I'll do a copy of this thing out to my ss table and apply all my changes in there so the just like with the in-place engine you're paying this penalty because you have all these extra copies of the data when everything can just fit in NVM then you don't need to actually make the second copy because this thing's already been persistent in rock dv now they assume that this is volatile so that's why they write it out to the ss table but if this is now non-volatile then you don't need to make the second copy so you can get rid of this duplicate data and you can also get rid of compactions entirely because you just take this and you put it aside and that's enough to actually still find the data and not worry about generating these long long runs so essentially what you do is just get rid of this entirely and then now you only have the mem table and that's enough to actually make sure everything's persistent and get the same performance yes your question is if we're not using mpcc so I'm not sure what you're asking if you're using mpcc if you're not using mpcc what do you mean so you're talking about the in-place engine yeah all right go back to this so you're saying if you're doing in-place updates you don't have to do we don't expect mpcc to just give the only basic version yeah that's what this is what are you saying? that you don't have to do anything different you don't have to do any logging yeah and this example here because if you make sure that the transactions changes can be applied and committed then when you come back this thing is guaranteed to be consistent it doesn't make sense so the issue is though is that for this one here what happens if you you do need some undue information somewhere if you overwrite this directly and if the transaction has committed yet you need to be able to make sure you can undo it mpcc that's easy because if you always have the older versions around let's take it offline let's keep it on because we're short on time okay so the the main takeaway I want to get you guys from out of this is that with non-volatile memory the key thing that's going to be if we know that we're operating on non-volatile memory then we can end up desvizing storage architectures that can reduce the amount of data you have to write because you know that you could be writing to memory that is actually persistent so you don't have to make a separate copy you put things into a block and flush it out to it you can design the system to say I know that I can write to this region of memory and therefore it's going to be persistent and that's enough to make sure that if I crash and come back everything will be there so for some things the reason why I don't think DRAM is going to go away because there will be some things where you actually want a volatile buffer because you don't care like if I'm doing a hash join who cares if I crash in the middle of the query and my hash table gets wiped out and that will also reduce the number of writes and where-down I'm doing to my storage device so I think that the future is going to be you're going to have DRAM and MBM together and you can have the MBM actually be the the source location of the entire database but then you can have DRAM actually be for hot data and also for temporary data structures and that's probably the best way to get the best performance out of this hardware and then as we said there's essentially for these recovery optimizations we can actually load this much faster if we're just dealing with pointers to make sure that everything's consistent rather than replay the log and that was the big thing right behind right behind logging paper so the next area we want to talk about is GPUs GPUs have been around for obviously a while but it's up until very recently people have only been applied them for graphics but now with the advent of Bitcoin mining or deep learning people are looking at how you can apply these highly parallel GPU architectures for other types of computation so probably within the last decade or so people have looked at using GPUs for accelerating the execution of analytical queries so GPUs have a lot of cores thousands of cores compared to maybe a couple dozen cores on a single socket on Intel Xeon but these cores are much more simple than the Xeon so you can't actually have them do really complicated instructions so you can have them do repetitive operations that are relatively simple but what the CPU can actually support you may be able to boost performance of your system so the kind of things that we're going to want to target on our GPUs are anything that does not require any blocking and doesn't require any branches or conditionals as you actually do your execution so the best possible scenario would be to do a sequential scan with a filter because you can have every single core, a thousand cores all operate on different streams of data and each apply the same filter and they don't need to communicate with each other or have an if branch that go down some path of code versus another you sort of stream the data in and you're running at almost bare metal speed so obviously what would be bad for this is a B plus tree an off the shelf B plus tree that's doing index pros because you're going grabbing a single tuple it's hard to parallelize that across a thousand cores right the another there is one I've seen a couple papers that have devised B plus trees that can run on GPUs but again you wouldn't want to do transactions on these so the key thing to understand how GPUs are going to work and how they're going to fit into our data system hierarchy or architecture is that GPUs have their own memory that is super super fast but it's not cash coherent with the CPUs memory now I say usually because AMD has this thing called the APU and I think Intel recently announced that they have sort of graphics GPU like accelerator appearing on the same socket I don't know whether it's cash coherent so you basically have a big pull memory for your your CPU and that's where your database is and then a large but not as big pull memory for your on your GPUs and the big cost is going to be moving the data back and forth between the two of them so the way to sort of think about it is like this so you have your CPU that has DRAM and this will be the primary storage location of your of your database and then somewhere else on your host machine you'll have a bunch of GPUs and so the way to think about these GPUs is just again just a bunch of smaller cores with their own memory that are programmable so the way to sort of think about how you want to devise your system and how you can actually use this is going to many ways is going to be limited by the speed in which you can get data from the CPU down to the GPUs so roughly speaking for DDR4 going between the CPU and DRAM is around 40 you know gigabytes a second but going over the PCI Express bus from the CPU to the GPU is around 16 gigabytes per second right so this is a significant bottleneck we can process the data much more quickly over here so the benefit we would get from having all these extra cores might be negated by the fact that we can't get our data down there now we'll see this in the next slide there are some data systems that have proposed where basically you can dump the entire database on your GPUs and all your queries run down there right and everything ends up just being a sequential scan because you can't build an index down here and every single core is going to just scan some segment of the database or the table in your query and produce your answers that way and of course this limits you to the amount of memory that you have in your GPUs so now you can chain these guys together or link them together so NVIDIA has something called NVLink that allows you to have a 25 gigabyte connection between the two GPUs and actually you can get the same connection to the CPUs on some architectures so as far as I know you can't use NVLink on Intel but IBM on power 8 power 9 allows, provides NVLink capabilities to allow your GPUs to actually access memory up in the CPU but I actually don't know whether it's cash coherent so there's three basic ways you can architect a database system to use GPUs so I'm showing here a bunch of different systems that are out there and they all do different things but they're mostly moving towards this last model here so as I said the easiest way to have used GPUs in your database system is just take your entire database and then stick it down onto your GPUs every single query will run entirely on the GPU and then it spits out the answer back to the CPU that you then send back to the client and of course this means that any time that you update the database you have to either unload the thing and bring it back in on an expensive merge operation to apply your changes to the data that's resident on the GPU so for this case here all the queries can do massively parallel sequential scans just crunch through the data as fast as possible and some of the numbers that get it are actually quite ridiculous so I think that there's algorithms to do basically every single relational operator you could want like hash joins supporting scans obviously aggregations all those things could be implemented on a GPU another approach is sort of take a hybrid storage approach where you're going to put just the most important columns you have on your database put them down on your GPU and then you retain the rest of your database up in CPUs memory and then whenever you have a query come along you're going to figure out at first what portion of the query can I exude down on the GPU and hopefully that cuts out most of the crap you don't actually need to look at for your query and then the GPU spits back out offsets for the tuples that satisfy the predicates that were ran on the GPU resident data and then you just use the CPU code now and again go to your lookup and memory find the remaining records that matched and then do whatever additional filtering or processing on them but again the GPUs have I think the most maybe 100 gigs of memory whereas the CPU can have terabytes and the idea here is that you just want to use the memory on the GPU in a more intelligent way so that you don't try to put everything on there but you still get some of the benefit of GPU processing the last one that is probably becoming more common is that the database always resides in the CPU memory or even out on disk and then you have these streaming algorithms that can stream data down into the GPUs over the PCI Express bus or the MB link do all your processing and then spit back results to the CPU and the tricky thing about this is that you have to you want to orchestrate it so that the the GPUs are never starved waiting for data like you're always just streaming data as fast as possible from the CPU down to the GPUs they're crunching and producing results and then by the time you finish processing the first batch the next batch is all ready to go for the process this takes careful engineering to actually make this work but again as far as I know all relational operative algorithms can be implemented in this fashion this again is a quick pitch for those of you that are coming back in the fall we're going to have our seminar series on hardware accelerated databases starting in I think the first week of September and all these companies listed here still waiting actually these guys don't respond to my email and I'm waiting to confirm with them but all the other ones are and a few other are coming to this this seminar series so again this is probably more common these are all startups Brightlight is actually based on Postgres but as far as I know all the other ones are based on new architectures and probably what will happen is that this looks like to be a promising addition to having a database system the major database vendors will buy some of these startups and then it'll get merged into the the full systems that's usually what happens so Oracle will have this in 10 years alright so the last thing I want to talk about is hardware transactional memory so the way to think about hardware transactional memory it was actually invented by Maurice Hurley he used to be a professor here at CMU in the early 1990s the same guy that invented linearizability he should get the Turing word probably in the next five years hopefully the way to sort of think about this is that the hardware itself is actually going to manage transactions for us now when I say transactions I don't mean the the larger logical transactions that we deal with in our database system sort of think of these as like mini batches like protecting critical sections in our code and the idea here is that in your code you can now define you can tell the CPU hey I'm starting a transaction and then you have your code do whatever it is it normally would do and then you commit that transaction and then the hardware figures out to see whether there was a conflict with any other thread that may have read or written to the same memory regions that you read or written to and if so you enroll in your back so at a high level these things basically operate in the same way that optimistic and critical works that you maintain all your reads and writes in your private workspace and then when you go to commit you validate with other threads that may have been touching the same regions as you have so Intel added this in the instruction set in I think 2012 and they'll call it TSX or transactional synchronization extensions I don't know whether this exists in another hardware so it first came out they announced in 2012 you could buy chips with it in 2013 and then in 2014 apparently there was a big bug in this where it wasn't actually transactional then they turned it off and then since then I think 2015 2016 they turned it back on so any Xeon you buy now will have this capability so now one key limitation about this and the reason why this is not going to be a replacement for all the concurrency go stuff that we talked about before for our transactions or database transactions is that the read, write set of any transaction running in this environment has to fit in your L1 cache which I think is 32 kilobytes and the reason is because they're actually going to piggyback off of the cache coherency protocol that they normally use for a multi-core multi-sock environment is that to figure out whether there was actually a conflict between two transactions in the system so it means that we can't use this for general purpose transactions but we're going to be able to use it in some smaller scenarios to help us build latch-free indexes and I say latch-free in terms of we can still define our latches but the hardware actually won't actually write to them there's a paper from the hyperguides as always in 2015 actually this is from the BWT people this is showing you how to use it in the context of indexing there's a paper from the hyperguides I think it's on the reading list in the class but they basically show how can you use hardware transactional memory for general purpose transactions and you basically have to break it up into mini transactions to make sure everything fits in L1 cache and as far as I know nobody actually ever does this okay I did so the programming model for htm at least in the context of x86 on intel there's basically two modes so the first one is called hardware lock lesion and basically the idea is that you declare a thread that's running a hardware transaction or transactional memory transaction and anytime you do a write to like critical sections to like a piece of memory like a lock or a latch the the cpu actually doesn't apply the write so it's sort of like a Jedi mind trick the hardware says yeah you wrote to it no problem but then anybody else tries to read that section won't actually see your write and then what happens if there's a conflict like if two people try to write to the same location then your sort of mini transaction in the critical section will get aborted and then the cpu will automatically roll you back to the beginning of your critical section and then start it over but then now it actually takes all the latches on the memory locations that you actually wrote to so the first way you go through you actually don't apply any writes and nobody sees them if there's a conflict you roll back and then do it all over again and then actually take writes on the locks the sort of more enhanced version of this is called restricted transactional memory so this is the same thing as HLE but what will happen is if you have an abort you can actually have a callback function like a location a code path that the hardware will jump to upon abort that allow you to do something special something different so in this case here the first one is automatic if there's a conflict you get rolled back and you re-execute from beginning in the same spot again but now you actually acquire your latches in this case here you can actually jump out of that and execute some other code so as I said the only way the only thing I think this is actually useful for at least in this current incarnation is to do latch a lesion in in trees for indexes so we saw it before when we talked about index lashing at the very beginning of the semester if we want to do we want to do an insert into our index like add key 25 here we would do the speculative latching approach where we take re-latches all the way down and we can release them once we know that we're at a node in our tree that's considered safe meaning like for our insert we don't have to split so then we get down here and do our exclusive lock and now we do our insert this is how you would do it without this Harbor transactional memory but with Harbor transactional memory the program is sort of like this so you have again this section here this would be where you would declare to the CPU here's my transaction and so I go the scope of everything I'm acquiring is all in here and again I'm saying latch and unlatch but it's actually not going to apply that change or you're actually not going to try to update anything right if anybody else comes along and tries to look at the same thing they'll see that everything is unlatched then we get to the bottom here and we go ahead and commit our transaction and then now it could apply our update for acquiring our latch so now it's sort of like we magically got down to the bottom without having any conflicts with anybody else acquiring latches at the same time so someone could have come down and gone this way and maybe update down here and not have any collision with us because we went down another branch so it's sort of like from the outside it looks like you started the route and then magically you jumped here to the bottom and with the correct latches and everything's still safe because underneath the covers the CPU didn't actually apply your updates so it's this clear so as far as I know nobody actually does this in uses power of transaction in this manner or does it for any other part of the system they might do it they might do it and not disclose it but I haven't seen any papers and say in memory database or commercial disk database that does stuff like this there's only been research publications even though hardware transaction memory has been out for a while okay was that because I think again the my guess would be because everything has to fit in L1 alright underneath the covers it's basically doing OCC as well I mean I also think about it too like like the existing systems have years and years and years of B plus G code that uses latches and crabbing things like that to go ahead and rewrite everything to use this would be a major undertaking that's probably why alright so to finish up this is mostly about non-volta memory but again non-volta memory is actually really interesting non-volta memory when it comes out could require us to rethink entirely how we build database systems so I suspect that in five years when non-volta memory becomes more common for like the introduction database class we will throw out all the crap we talk about buffer pools and low level disk based stuff like MVM could be just complete game change for how we think we design our database systems that's not to say like the disk ordering systems will not still be around but nobody will build a system to targeting SSDs if MVM becomes what they claim it will be able to do like large capacities that are super fast and durable and in case of GPUs again this will be covered in the seminar in the fall right now none of the major systems as far as I know use it but there's definitely a lot of start ups in this area and then hard to use this actual memory is only useful for do a lot of lesion in indexes okay so that's pretty much the end of lecturing from me for the end of the semester so again I will hand out the final exam on Wednesday you know it won't be too bad and then we'll have the guest speaker from Snowflake come talk about their system and I think it's always useful because they will talk about things that they'll cover basically a lot of the same things that I talk about but it's sort of put it in the scope of the context of trying to run a real system okay, yes what does the final exam do I'll make it do the same day as the final presentation so May 14th I'll put it to the website yes last lecture we talked about DRAM and it's non-trivial line of energy consumption do you know where NVM stands relative to that this question is what is the energy consumption of NVM so one is it's a passive circuit so like you don't have to keep refreshing it that's not true, I can't say why let me turn off the video, I'll say why but it's purported to be less right certainly if you have one terabyte of DRAM that's going to be sucking a lot more power than one terabyte of NVM yes first how expensive do we expect these things to be and second off camera, yes keep going and second doesn't that just not only change the way we think of databases but change the way we think of computer architecture so his question is doesn't the idea of novels and memory change not only the way we think about databases but also the way we think about computer architecture absolutely but the most important application is databases yeah so that's my answer, alright any other questions thanks for watching to cool it off we're saying aides