 All right, so today's lecture is sort of also another introduction lecture, what we'll be discussing in memory databases. The idea here is to motivate the rest of the semester where we sort of talk about things you may have learned in an introduction class and talk about why in an in memory database a bunch of the stuff that we talked about before are no longer necessary in an in memory environment. So we'll start off talking about the background information you need to know about database systems and then we'll talk about in-memory database system architectures and why they're different and then I want to spend a little time talking about some early notable in-memory databases that came out in the 1990s and then we'll finish off talking about a high-level overview of what the Peloton system is and then we'll finish off talking about project number one. Of course, stop as we go along if you have any questions. All right, so the last class I gave the history of database management systems and the truth is it was really actually a history of data models in database management systems, right? If you remember, I went through talked about, I opened up talking about IDS and the codicil or network data model, then I talked about the hierarchical data model in IMS and then relational data model in Ingress and System R and it was really about again how these different systems have come along and tried different data models over the time and now sort of the relational data model is end up winning out the early 1970s and it still is the dominant data model today. But the overarching theme about all the systems I talked about over the decades was that all these systems were inherently disk-oriented database systems. So up until you got around to the late 2000s, early 2010s, when we talked about the no-seval systems, the new-seval systems, pretty much all the systems I talked about were disk-based database systems and they were designed at a high level based on the same architecture that was used in the 1970s when System R and Ingress were developed, right? So they assumed that the Harvard's were looked at at the time and they make certain design decisions that are predicated in those assumptions and essentially the history of database management systems from an implementation standpoint is really a history about how to deal with the limitations of hardware. So back then in the 1970s, the Harvard that they had to deal with when they're building those first database systems are a lot different than the Harvard we have today, right? So foremost is that they were dealing with single socket uniprocessors as you have a single core CPU, a single program counter. The amount of RAM that was available to these systems was was quite limited, right? You were lucky to maybe have kilobytes and then that means that because RAM was small, that means that the database is going to have to be stored on disk, right? And the disk was really, really slow back then, right? It's slow now, it's even more retardingly slow back then, right? So this is the environment that they dealt with and they made certain design decisions based on what they had to deal with. And a lot of these design decisions actually permeate up until today, right? My SQL Postgres Oracle, you name it, there's a lot of things that they do which made sense back then, which may not actually make sense now. So now that DRAM capacities are such that it's now large, you know, it's actually able to store almost all of the largest database systems entirely made memory, then this changes how we want to view and how we want to design our database management systems. So I'll just say as an aside, the kind of databases we're going to focus on in this class are called structured databases, right? So these are things where there's a schema that the database system enforces and there's some application that's writing data into it and reading writing data into it. So these are typically going to be much smaller and therefore it's not unfeasible that we can store everything in DRAM, right? When people talk about the Google's, the Amazon's, and the Facebook's of the world, they have really, really large databases, but these are either semi-structured or unstructured data, right? The database full of photos from Facebook is unstructured data. You can't really select statement on the contents of an image. You have to extract features out of it and then put that into a structured database, right? If you have a log file, if you have a log file generated from your even web services, you're going to write those things out to HTFS, right? And then do, you know, run something in Spark or Hadoop later on to process them, right? These are unstructured or semi-structured. So really focusing on these guys. In this case, most databases are a couple gigs, maybe get a little bit larger, a couple hundreds of gigs. The largest OTP database that I knew about for a while was, that could be stored entirely in memory was maybe a couple hundred, hundred terabytes, right? We're not talking about petabytes or exabytes here, right? So it's now perfectly, you know, it's now possible to buy a machine on Amazon for a few bucks an hour and have a lot, a lot of DRAM. The machines you guys be using for all your projects in this class that will provide you guys here at CMU, each one has 120 gigs of RAM. That's, that's a lot of data, a lot of structured data. Right, so now the question is, if we have enough DRAM to store all of the largest database in entirely memory, why not just take one of these existing, well I'll say traditional, in quotes, these traditional disk-oriented database systems, Oracle, MySQL, Postgres, SQL Server, you know, those guys, and why not just take one of these guys and just give it a really large cache, a really large buffer pool? Is that going to be the same? So to understand this, we want to see how a disk-oriented system is going to work and then we'll understand why, you know, the spoiler is the answer is no, right? And this is essentially the paper that you guys read for today's lecture. So first we need to understand what a disk-oriented database system looks like. So the, my definition of a disk-oriented database system is one where the system architecture is predicated on the assumption that the primary search location of the database is on non-volatile storage, where non-volatile storage could be either spinning disk hard drive, an ECP, or a NAND flash solid state drive. And so what's going to happen in this system, the database is going to organize, sorry, the system is going to organize the database in these files on disk, these heap files, and within those files you're going to organize and further into pages, and that's where you're going to store your tuples. And the organization most data systems use are slotted pages, right? So you can pack multiple tuples into a single page. And so now you can't operate directly on data on disk, you have always copied into memory, and the database system is going to use its buffer pool as a cache for the pages, the slotted pages to retrieve some disk, brings them into memory, and then it's allowed to manipulate them any way they want, or use them in a query to answer a request. And so the database system is in charge of deciding how and when to move those blocks of data on disk, the slotted pages, back and forth between disk and memory. So as a query needs something, its query needs a tuple, and that tuple is not memory, it knows how to go out into disk and gather. And the key thing about that's going to make all this work is the buffer pool. So again, the database system starts running a query, it accesses a page, the database system has to check to see whether that page is in memory, whether it's in a frame in our buffer pool. So if it's not there, then the data system has to go out to disk, retrieve it, and copy it into a free frame in the buffer pool. And just to be clear, because the terminology becomes confusing, the database will be organizing the pages on disk within those pages, they have slots to tell you where you can store tuples, and then a location or a spot in the buffer pool in memory where you can store one of those pages is called a frame. You would think it'd be called a slot because slot was already used for the disk page, they tried to use different terms. So we want to bring a page in from disk, we need to find a free frame in our buffer pool. If there's no free frames, then we have to evict one of them to free up the space and put the thing we need in there. And so the database system is going to run some kind of eviction policy algorithm, whether it's LRU or CLOCK, the things we talked about last semester, or something more sophisticated like IBM's ARC, or other things that can keep track of what's the likelihood that the page you need to evict will be needed in the future. You have to run some policy to decide which page you want to get evict to make space for the new guy. And this is a good example of what differentiates the open source systems between the commercial systems, like the open source would be like Oracle. So open source system would be like co-stress my SQL SQLite versus like the commercial guys be Oracle, IBM DB2, or Microsoft SQL server. This eviction policy is sort of the secret sauce for a lot of these systems. And then you weigh more sophisticated things than what the open source guys do. I think we'll look at statistics and look at history and make decisions about what the page you need to evict, what is the index, what is the data, what kind of data is it, what the query is. There's a whole bunch of extra stuff you can consider when you decide how to evict a page. And the idea here is obviously you want to evict a page that's going to have the least amount of impact on your performance, like you want to evict a page you're most least likely to need in the future. All right, so now we run a policy, we find a page that we want to evict. If that page is dirty, the victim page is dirty, then the database has to write that out the disk. And then you have to do a bunch of other checks. We'll talk about in a second for logging, we want to make sure that the log record for the transaction that modified your page is written out the disk first before you write the page. There's all this extra overhead we're doing to move data back and forth, just to get one page in. So then now, once the page is in memory, the data is just going to translate the on disk address to now the in-memory address, where it exists in the frame in the buffer pool. So let's look at it over you here, okay? So we have our buffer pool, this is in memory, we have three frames, and then we have our heat files on the database, and this is out on disks, and these are organized in within each page, it's doing a slotted organization. And then we have our index, and for simplicity purposes, I'm going to say this index exists entirely in main memory, so we don't have to worry about fetching its pages or writing them out, but in a real system, these pages for the index would actually also be stored in the buffer pool as well. Any page for memory that's not internal data structure that's part of the actual data itself will always be stored in the buffer pool. And then we also have this page table here, and the page table is going to be the indirection layer we use to find the page that we need. So let's say our query comes along, it does a traversal into our index, and then we end up with a page ID, or sorry, a record ID that consists of the page ID and the slot number. So we can then reference now the page table and say, well, we want to access this too, boy, if this page isn't in memory or not. So let's say for this query it wants to access page one, so the page table will say, oh, no, it's out on disks, here's the location where it is. So now what we're going to do is we're going to set a latch on this entry, because we know we need to go fetch that tuple and bring it into memory, and we don't want anybody to try to do the same thing at the same time while we're doing it. So now we need to run our eviction policy, because we only have three frames, and we need to bring a fourth guy in, so we've got to evict one of these. Let's say our algorithm runs, and for whatever reason it decides that page number two is the one we want to evict, and we set another latch on this in the buffer pool, because we make sure that when we evict it that nobody tries to start reading it, we're writing to it, because this thing has been destined to be removed. So then we're going to write this out to our disk, where it existed before, and then now we can then copy in the page we needed, bring that into our buffer pool. Are we done? We should have said no, why? What's that, sir? Yeah, I've got a different page. Right, so now you've got to go back and update the page table and say, oh, if you need page one, here's where it is in the buffer pool. Right? Are we done yet? Sort of obvious, right? We've got to release all our latches, right? So that the next person can come along and find the same thing, right? So every single tuple access has to go through this process, right? Because you don't know when you actually execute the query, you don't know whether the tuple you're trying to read is in memory or not, and you use the page table as an indirection layer to find the thing that you're looking for, right? So you always get to translate the tuple's record ID into its memory location, and then the worker thread, the thing that's actually doing all these checks is going to have to pin pages to make sure that they're not swapped out to disk when you actually need them, right? So say that in that last example, the query actually wanted to update contents of page one, it would then pin it to make sure that nobody else could swap it out, right? All right, so now it goes to current control. So I said before that in the 1970s, RAM was really limited, disk was really slow, and they were dealing with a single core CPU, and so the database system would allow multiple transactions to run at the same time, but they didn't do this to take advantage of all the cores because they only had one. It was really to allow the system to still make forward progress, even though a one transaction may stall because the data that it needed was not in memory, right? So your transaction is running along. In my example four, I need to access page one. Page one is not in the buffer pool, so then I have to stall my thread while it goes and fetches the thing that I need, and so with the system would actually want to do, it would stall the logical thread, the logical query execution, but then it would use the core to then go execute other transactions that were blocked while the other guy was running. So the idea here is, again, when one transaction stalls because the data that it needs is not in memory, you can allow other guys to run and then still make forward progress, and this improves the reliability, improves the responsibility of the database system, right? Because otherwise, you had all these queries run, and then every time they would try to touch something that wasn't in memory, they would stall, and the system looked like it would be locking up all the time. So what's going to happen is the concurrential policy of the scheme for the database system is responsible for deciding how to interleave the operations of concurrent transactions in such a way that it appears that they're running one after another, when in actuality, we're running them simultaneously. So that means that we're going to have to set locks and latches in our system to provide the asset guarantees that we need to have for our transactions. The other key thing about an address-based system is that all the locks, the logical locks in our database, are going to have to be stored in a separate data structure, separate from the actual data. It might take us why you want to do this. So you have a lock table where you have the locks for tuples, then you have the data table where you actually have the data for the tuples. Why would you want to do this? So let's say I need access a tuple, right? I'm running two-phase locking, so I need to know whether I'm allowed to read it. So I have to acquire a shared lock. If the lock is actually stored with the data and that tuple got swapped out the disk, then in order for me to figure out whether I can read the tuple, I got to stall because I got to wait for the tuple to get fetched. In these systems, what they'll do is the lock table is always stored in memory, never gets swapped out. So I can go check the lock table, try to acquire the shared lock for the tuple I need to read. That's always going to be in memory. And it may be the case that once I acquire the shared lock, I try to go read that tuple and I get blocked because it's not memory. But at least I can acquire that lock or not without having to go to disk. Right? So that means, again, I'm going to use two separate fetches to decide before I can read a tuple. I got to go in the lock manager, go grab the lock, and then you do another lookup in the actual, the data you have itself, and then go over to the tuple. So it's a quick show in hands. Who here knows the distinction between a lock and a latch? All right, not all of it, good. Okay, so someone give me a definition of a lock. You took my clock, so yes, you. Me? You, yes. You need to acquire the lock to access it to execute it, and loading one lock can be a problem at a time. So you say the lock or latch, you just give a definition of a latch. Okay, let's talk, okay. So if you're coming from an operating system background, it can be confusing because what we're going to refer to in disk class and data systems as a latch in operating systems are referred to that as a lock, right? But there's a clear distinction here, at least from my point of view. So a lock is going to be the things that you can acquire on the logical contents of the database, and you're going to protect one transaction from other transactions. Now that transaction may be running across different threads, it doesn't matter, it's going to, it's sort of a logical level. So I can acquire a lock on the database, I can acquire a lock on a table, I can acquire a lock on a tuple, right, I can acquire a lock on an index. And so what's going to happen is you're going to hold these locks, if you're doing toothpicks locking, you'll hold them for the duration of the transaction, right? And restrict to your base locking, you wait till the very end, otherwise in regular to base locking, you can give them up before in the, in the, in the shrinking phase. And so the other key thing to do is that the, any changes we make to anything for when we acquire a lock, we have to be able to roll back those changes. So if I modify a tuple that I have a lock for, that transaction may end up having to roll back, I have to be able to reverse my changes. So a latch is the low-level primitive that we're going to use to protect internal data structures from the database system. So think of like a mutex, right? And not pretty similar terms, right? We can use mutexes to protect the internal, these internal data structures. So when I acquired that, acquired a latch on the page table for that, that one entry, that's a latch, not a lock, because like it's not a logical thing of the database, it's an internal thing of the system, right? So the latches are going to protect the critical sections of the data system for its internal data structures from other threads running at the same time. So even though two threads will be operating on behalf of the same transaction, we still need to protect themselves from each other, right? And we use latches for this. And so we're going to only hold the latches for the duration of the operation. Like if I have to go in and update my entry on my page table, I acquire the latch, do the update, and then undo, and then release the latch. Another key thing is that we don't need to roll back anywhere changes for things that we modify when we load a latch, because again, this is the internal, this is the internal state of the system, not the state of the database, because they're separate things. So for example, say my first, my graphical example, I have my transaction comes along, it wants to modify data in page one, I acquired the lock to do that for that tuple, but then in order, when I update the page table, I acquire the latch in order to make that change. But then if my transaction abort and need to roll back the change, I would reverse my modification on the tuple, because that's being protected by the lock, but I wouldn't reverse my change on the page table, which was protected by the latch, because that doesn't make sense. Because I fetch that page into memory, the page table is still correct. Even though the transaction that did the modification ended up aborting, the state is still correct. So this is from a paper that you guys have reading, and if you wait on the survey of B3 locking techniques, I just want to bring this up now, because it's going to permeate through a bunch of different stuff we talked about today and going forward. So this chart is from that paper and it was developed by Gertz Raffi. His name is going to come up a bunch of times throughout the semester. He invented the Volcano Cascades Optimizer style, he does a volcano iterative model. He's a very well-known database implementer. And so he asked us on this table a way to think about what the difference between locks and latches are. So locks are going to separate user transactions from each other for protecting the database contents for the entire likelihood of transaction. We're going to have different modes for the different locks we'd acquire. This is standard two-base locking stuff, shared locks, attention locks, exclusive locks, update locks. And because we can have dead locks for cross transactions, we have to have additional mechanisms to make sure that we can break those dead locks or avoid them. And the method they can use are timeouts, aborts, or weight throw graphs. And we're going to keep all this information in our lock manager, at least for a disk-based system. In an in-memory system, we'll see how we change this. Whereas a latch, again, is going to be protecting threads for each other for the in-memory data structures for the critical sections. We're only going to have rewrite latches. And instead of having like a separate deadlock detection thread to break our deadlocks, as we would in two-base locking and regular locks, we're actually going to have to avoid deadlocks by being careful programmers. So it's sort of on us in this class who is writing this code, it's up to us to figure out how to make sure that we don't have deadlocks between latches. The data system is likely to provide that for us. So now the confusing thing about locks versus latches is that the way we're going to implement latches, as we'll see in a few lectures now, the way the data structure, the primitive we're going to use for our latches is called spinlock. So it is what it is. And then if you take Dave Anderson's class, don't come harass me in the hallway saying a lock's a latch or whatever. I don't need that. But not the same. All right. So now you're talking about logging recovery. So most disk-based manner systems are going to implement a bufferable policy using steal no force. Steal means that the data standard system is allowed to evict dirty pages from the bufferable out the disk for transactions that haven't committed yet. And then no force means that the data system isn't required to flush all the dirty pages of a transaction when it commits. And so to make this work, you have to use a right-hand log where you make sure that the log record that corresponds to the change that was made to a dirty page is written out the disk first before you write the page out, the dirty page out. Right. And again, most disk-based data systems or disk-oriented data systems are going to use steal no force at the right of the log. So there's two things that are going to cause the logging and recovery mechanism in a disk-oriented data system to be slow. The first is that every log entry has to record before and after image for the record that was modified. You need the before so that you can do undo and roll back any changes. And you need the after image so that you can redo them. And then it's up for the data systems to decide whether your transaction committed or not. Which one of these two do you want to apply to a record and upon recovery? The other key thing, and this came up in the paper you guys read, is that all throughout the system, this idea of log sequence numbers, or LSNs, it's going to be permeated through all the different components. Because the system is going to keep track of, for each log record, what is this LSN? Which page did it modify? And it has to record the LSN in all these different places. You have to know what was the last LSN of the transaction that modified the record in a page. You make sure that log record is flushed before you can flush the page. So this notion of log records and sequence numbers is written all throughout the code in the system because they're dealing with this idea that any time data could be flushed out or evicted and written out of the disk. So this one graph here is from the paper you guys were assigned reading. To me, this really summarizes what we're trying to do this entire semester. And what I'll say is, the paper you guys read, it focused on transaction processing workloads, all these new workloads, but a lot of these concepts and things that they identified are the bottlenecks in a disk-oriented system, still hold true even if you're doing an analytical work. Maybe not so much for logging because you're not making changes, but everything else still matters. So what they did was, they took this disk-based data system written from the University of Wisconsin called SHOR and they instrumented it to allow them to measure how much time is being spent or how many CPU instructions are being spent in different parts of the code. If you've never heard of SHOR, think of it as like the inner DB or ROX DB. It's a low-level storage manager that you could build a more complex data system on top of. And so although SHOR was developed in the 1990s, the key thing and reason why they picked that system for this paper is that it had all of the standard components you'd expect to have in the storage manager for a state-of-the-art database management system. So it had logging, it had recovery, it had checkpoints, it had two-phase locking, it had B-trees. So it was emblematic of the existing disk-oriented systems at the time. So what they did was, they're going to break down the different parts of the system for where the code was spending or exiting the most instructions. So the largest chunk of time, the largest chunk of instructions was spent in the buffer pool. It's that thing I showed at the beginning where you're checking the page table, pinning things, finding to see whether you have the thing you need, and updating the page table to say that here's where it is. I should also say too, for this experiment, they loaded the entire database in main memory. So you're not stalling, going out to get things at disk. Everything is main memory. And this sort of answers the question I proposed in the beginning of, if I have a disk-oriented system and I have to give it a lot of DRAM, is this going to be enough? Do we not need this course? Can we just use that architecture? So 34% of the instructions are spent in the buffer pool, 14% of the instructions are spent latching, 16% of the instructions are spent locking, going to the lock manager, updating that. 12% of the instructions are spent in logging. So for this, they're not counting the time actually writing the data out the disk, right? They wrote to RAM disk, so they're ignoring all that, simply to send basement instructions. And then 60% of the time is spent on comparing keys when you're traversing the B-tree. So I didn't talk about this here. This will come up more in next class when we talk about predicate evaluation, right? This is still a big issue in an in-memory system. It's basically, if I have an integer and I want to check to see whether it's in my tree, I have to do my comparisons at every single node, to traverse the tree to find the leaf node and see whether my entry is there. So doing those comparisons in an inefficient way can account for a lot of time. So what does this leave us? This leaves us with a poultry 7% of the instructions actually spent doing useful work. So useful work would be actually executing the code for the transaction that they were trying to invoke. I think they were doing new order payment from the TPC workload, right? So what this is telling you is that for a database that fits in memory, there's all this crap that we're doing because the system doesn't know that it's dealing with in-memory database, right? The architecture is predicated on the assumption that it's going to have to go out the disk and fetch things in. So it's doing all this extra stuff that we talked about before because at any time a transaction could have a page fault and the thing that it needs is not there, right? And think about this in terms of another way, right? Say you take my SQL, say you have a 100-gigabyte database, so you give it a 100-gigabyte buffer pool. And so everything at that point is in entirely main memory. But my SQL doesn't know that you're not going to come back a week from now and load in another 100-gigabytes, right, and exceed the amount of DRAM that you have. So the system is all set up based on the assumption that it's going to have to spill the disk at some point, right? And so that's why we have to take this course because we have to figure out how can we remove all this bottleneck so we can span this part of the pie chart here. As I said, this is focusing on OATP, but you have the same issues in OLA. Yes. But it's still in the latching and locking, right, even the immobilized databases? What did I say? Again, sorry. It's still in the latching and locking, right? Yes, so this question is the same as in an immobilized database, we're still being latching and locking. You're absolutely right. So this part you can mostly get rid of, right? For locking, as we'll see in a second, the way they're implementing it because they need to separate the locks from the data, you're doing another lookup and a lock table and updating that. As we'll see throughout the semester, there's actually way more optimized implementations where you can store the locks directly in line in the tuples, right? And then you can compare and swap to do low-level, do fast changes to these locks without having to set a view tax while you do this. Logging, you can't get rid of, but I'll say is that, and they're not measuring the time you flush out the disc, I'm jumping ahead, but we actually don't need to store the undo information, we don't need to store redo, and we don't have LSNs, right, because we don't have dirty pages, right? So yes, some of these things we can't get rid of, but because we're in memory, we can come up with more optimized implementations. And that's essentially this course. Okay, so now we can talk about in-memory databases. So my definition of in-memory database is one where the architecture assumes that the primary source location of the database is permanently in-main memory. So that means that when I want to do a lookup for my tuple, I don't need to check the page table to find out where it actually is, right? I know how to go exactly in memory to get the thing that I want, right? So I will say we'll relax this later on the semester, we'll talk about bringing back in disk, but this will be like the second last lecture for the entire semester, we're going to assume that everything's in-main memory. So although in-memory databases are in vogue now, it's certainly not a new idea. There's a lot of groundbreaking work done by Dave DeWitt and others at the University of Wisconsin in the 1980s, but all that work they did was essentially a prototype simulator. I mean, because back then DRAM prices were really high and the amount of capacity you could get on a single machine was quite low. And right now this has totally changed, but now it's certainly feasible to have a lot of memory to store large databases. Yes? So it's a very permanent memory. I mean like someone using an in-memory database is like assuming like the plug never gets pulled up, nothing like catastrophic than I have it. Okay, so the statement is by my word permanently here, what do I actually mean? I mean that this, so in this other statement was if someone pulls the plug, is it imagining we used to be there? No, right? We're still dealing with volatile DRAM. You pull the plug, everything is gone, but we can add, we're still going to have protection mechanisms to make sure that that doesn't go away, right? We don't lose any data. So we're still going to provide all the ass and guarantees you would have for a disk-oriented system, but the architecture is going to assume that when I do a look up for a tuple, that's going to be in memory. And this allows me to avoid a bunch of extra crap we had to do in a disk-oriented system, right? Now there's a system, there's a new class of hardware called non-multi-memory, which this might actually be true, but we ignore that for now, right? Okay, so if disk is gone, disk was the main bottleneck that we had to overcome in our disk-oriented system. If the disk is now gone, all this other bottlenecks have to show up, and we didn't have to deal with before, right? And this is all going to be in software, right? Before, if you had a disk-oriented system, because going to disk was so slow, you care as in if your software was, you know, not as efficient as it could be. But now that everything's in memory, this exposes all a bunch of other warps and other issues you have to do now at the resolve, right? So we're going to see big problems in locking and latching, cache line misses now become a big deal, right? Having to go to memory, as I'll see in a second, is three or four times slower than going to your L3 cache. Pointer chasing is a big deal, right? So you're looking, doing lookup in the virtual function table, right? If you're using C++, this starts to get expensive, especially if you're doing a large scan or a large dataset. Predicate evaluations, this is related to the vTree thing I was saying before, like trying to compare, you know, to something equal something or more complex predicates, this starts to become expensive, depending on how you implement it, right? Because now you can start hitting, you know, pointer chasing and other problems. Data moving and copying will be a big problem, especially if you have a multi socket machine. So now, if I'm trying to, trying to scan two tables, or scan a single table and it's split up across in memory, that's sort of banks on one socket and a bank on another socket, I don't want the threads on one socket trying to retrieve data from the other socket DIMs, because now I've got to go over the QPI and that's going to be slower than accessing DIMs that are close to me. Making extra copies, having a malloc is, is me a big problem. I'm going to see a lot of how to deal with this. And then lastly, the network also becomes a big bottleneck as well. So now this class is not going to be on distributed databases. So we're not focusing on the communication of two different, or two nodes in the same cluster for a single system. But it's really now the overhead of the application code communicating with the database system. So say I'm running a transaction and I have my application server running PHP or Java or Python, whatever, it starts a transaction, executes query, the data system executes it, then the data system sends back a response, and then the application does some more processing before it sends the next query. Because we're, we have an open transaction, any locks that we require to, to, to do that first query, the system has to hold while you're doing all that network traffic back and forth. So this has now become a big bottleneck as well. So the, the typical way to solve this is to use stored procedures where you basically take the application code and you can bet inside the database system, and now it becomes a single RPC request to, to invoke a transaction. A lot of the academic papers are going to read the semester, make that assumption that everybody's going to use stored procedures. In reality, we've done a survey of DBAs and they're not that as common as maybe academics think they are, right? Because kind of, it's from a software engineer standpoint, it's difficult because now you have your application logic embedded in the application server, plus some of the stored in the database server, and how do you make sure those things are in sync? So they're not, stored procedures aren't the solution to everything. It's one way to solve this, but in general, we can't avoid this. So just some numbers that you should be mindful of throughout the semester, as we talk about things, is the latency for different storage devices. So for SRAM, or your last-level Cache L3, you can read the writes in roughly about 20 nanoseconds, right? Going to DRAM, depending on whether it's your DM that's local or whether it's on another socket, it's roughly around 60 nanoseconds. So about three to four X slower, right? So if you have a cache miss, it's not as bad. It sucks, but it's, it's not the end of the world. So when you start going to the non-volatile devices, then it's when shit gets crazy, right? Because now for an SSD to do a read, now it's 25,000 nanoseconds. For doing the writes, 300,000. And the reason why the writes are slower than the reads is because they have an indirection layer on the actual hardware itself to decide what block or what page to write your data into. And then you wear leveling and garbage collection, right? It's sort of a complicated thing that's running directly in the hardware. For an old spinning disk hard drive, it's roughly, you know, 10 million nanoseconds or 10 milliseconds. Your drives can maybe do five milliseconds, but still it's like it's, you know, orders of magnitude slower than these guys over here. So because in memory database doesn't have to deal with this side of the graph, this allows us to do more sophisticated things in software to take advantage of in memory storage, right? And all the things I showed in the previous slide about here's all the bottlenecks we have to deal with, right? These become now more prevalent because we're dealing with storage latencies at this level here. It's maybe kind of hard to wrap your head around like, oh, what does it mean for, you know, 10 million nanoseconds? Right? So Jim Gray always has this nice analogy that I like to use, where the way to think about the difference between L3 and your SRAM versus a spinning disk hard drive, reading something from L3 is like reading a book directly in the table in front of you. Reading something from a spinning disk hard drive is like flying to Pluto to read that same book, right? And again, it seems like it's hard to sort of fathom us because we're dealing with really, really short timescales, but this is orders of magnitude slower, right? And we want to avoid this as much as possible because we don't want to go to disk, we get everything's in here, we can do things much faster. And that's what this course is really about. All right, so let's talk about how memory data is going to organize data. So no longer we need slotted pages, right? Because we're not worried about storing, packing data into a single page that we can then write off the disk, right? So we're still going to organize our tuples into blocks and pages, because that's easier for bookkeeping, right? You don't want to call, you know, every time I need to insert a tuple, you don't want to call malloc for that, you know, single tuple, all right, and then write your data into it. You're going to do a malloc to require a large block and our system is 10 megabytes. We pick that number at random. We haven't done an investigation to see what that matters. I don't know what other systems do, but again, it's not, you know, one kilobyte. So we're going to allocate a large block and then we start writing the tuples into that block and that's how we're going to organize everything. So because everything's in memory, in some cases we can use direct memory pointers instead of having to do that in direction with the record IDs, the page ID and the slot number. Depending on what, if you're using multi-versioning, you can't use direct memory pointers and we'll see why as we go along. We also now are going to have a distinction between the fixed-length and variable-length data. So we're going to have actually separate memory pools for each of these, right, where before in the disk-oriented system, if you have a variable-length field, they'll try to, they'll pack them into the single page where the rest of the data is, right? If it's a really large value, like a text field, then they may split that into a separate page, like Postgres calls this toast, toast storage. And our purpose here for in-memory data is we're always going to have separate pools. We're also going to maintain check files for our pages in memory and this is where software airs from, you know, one thread writing to some memory address that it shouldn't in trashing our database, right? Because everything's in memory, right, any thread can write to memory that that's in the same process. So we need to make sure that we don't have problems. So now the operating system is also going to organize memory and pages. This is something we're going to discuss as we go along. This is actually a program where you can tune in the kernel of the process. You can do huge pages. It has some benefits and disadvantages as well. And we'll discuss this more throughout the semester. So let's go back to that diagram we have before for our database organization, right? So no longer do we have the page table because we don't need that, right? We have everything that's organized in these blocks. So now when I do my lookup in my index, and depending on scheme, current show scheme I'm using, I could get a memory address or I could get a sort of logical number to tell me what block I need to go to. But basically, again, this is going to point directly to some memory location that is going to be the starting point for my tuple. And then the first thing I want to land on is on the fixed length data blocks. So fixed length attribute would be things like a 32-bit integer or, you know, 64-bit float, right? They're things where no matter what the value is, the length of the attribute is always the same, right? So if I have a 32-bit integer, if I have the value zero or the value one million, it's always 32 bits, right? So all the fixed length data to be stored in this block here. And I'm showing a row store. We could also have a column store. It doesn't matter. But if for anything that is variable length, instead of storing the value in line here, I'm actually going to store a 64-bit pointer that's going to point to some location in the variable length pool that's going to attain the contents of that attribute or the value for that attribute. And we do this because we want to have it be—it's easier for us to jump to different offsets in the fixed length block. And we don't want the word about packing in the tuples as we would in a slotted page, right? If we have 10 slots or 10 locations, we can write tuples up to 10 blocks. We could always write 10 tuples, no matter what the length of the values are, right? And we do this for bookkeeping reasons and efficiency reasons. And then on the variable length pool, we can use a slab allocator like we'd use in JTMailer or whatever you want to use to allocate these guys here. Different—typically what you do is you have different arenas. So you would say, here's a bunch of memory locations I can write for anything that's less than one megabyte, anything less than 10 megabytes, and so forth, right? And the various systems do different things. We'll cover that later. All right. So now at this point, the question I always get, typically not in this class, but from people in industry that maybe not know databases as well, when we talk about in-memory databases, the question that always comes up is, why did this use MMAP? Right? Why, you know, why do you spend all this time managing memory ourselves when we use MMAP and it'll solve all our problems? So here, who here knows what MMAP is? Anybody in my class should know what MMAP is because I spent, like, 10 minutes complaining how terrible it was, right? So MMAP files is a feature provided from the operating system that essentially allows you to take a file that's on disk and you MMAP open it and you say the contents of that file are now mapped into the address space in your process, right? And what happens is anytime your thread now tries to touch one of those memory locations, if it's not in memory, you have a page fault and it goes and fetches that page from the disk and it plots it into memory for you, right? So essentially the operating system is now in charge of figuring out what's in memory versus what's on disk. So the question is why could we just use that instead of having an explicit buffer pool and then avoid the page that will avoid that extra overhead? So the issue we'll see in the next slide, we're essentially giving up control to the operating system for things that we want to manage, right? So you can use hints like MAdvice and tell the operating system what page is time to pin to make sure they don't get swapped out. You can use M Sync, which is essentially telling it that I've written to this page. I want to make sure that it's actually flushed. This seems like exactly what we want, but there are very, very few data systems that actually do this, right? The most famous one is probably MongoDB. When MongoDB first came out, they used the MF as their as their profitable manager. And then they end up ditching it later on when they bought WireTiger. WireTiger is like level to me or RockDB. It's a multi-thread storage engine that MongoDB then wraps around. WireTiger was started by the guys that started BerkeleyDB, which are awesome database people. Another famous MF database is MonoDB. MonoDB is a column store and a little database system at a CWI in Europe. This is a good example of an academic system that actually made it out to the real world and people actually use. And then there's MFDB, the Lightning Memory Database. This is like an embedded database, again, like WireTiger or RockDB, something you can embed inside of your system. So this is not very many, right? So now there are some data systems like ImfluxDB I found out uses MF, but they only use it for read-only data. They don't actually write any data, whereas MonoDB and LMDB definitely use MF to write data. So MonoDB raised a shitload of money, right? I think they raised like a quarter of a billion dollars. And then last fall, they went IPO. So they have just gobs and gobs of money, right? And they have awesome engineering people there, right? I know a lot of those guys and they're solid people, very talented. So with all that money and all that talent, and they couldn't make MF work well, and they end up going to WireTiger and ditch MF entirely. So now when you use MonoDB by default, you get WireTiger. I think MF engine is still there for legacy reasons, but there's not really any active build, right? With MonoDB's limitless resources, they end up ditching MF. There's other cases, too, where other database companies have told me, yeah, we started with MF. Actually, the Hyper guys did this in Germany. Hyper is a system we're going to come across a lot through this course. They originally started with MF, and then they ditched it. So the reason is because you're essentially giving up all fine grain control of what's in memory versus what on disk to the operating system. So what happens when I try to access a tuple that's in a memory map page and that page is not in memory, I get a page fault. The operating system blocks my process, blocks my thread while it goes and fetches the data that I need and then brings it in, right? So I can't do anything with that thread while that happens. I'm basically giving up complete control to the operating system. But in a buffer pool case, with a buffer pool-based system, for the Davis and Bannon use no memory, then when I try to check my buffer pool to see what the page that I need is in there or not, if it's not, I can tell another thread, hey, go fetch this thing and then tell me when it's actually in memory. And then my thread can go off and do other work for other transactions or other queries. Right? Yes. Even in the buffer pool case, under memory pressure, the physical frame can be pushed out to the disk, right, for up to the disk. Because eventually it is, the DBA uses a process and then it has a page table and the page table is basically a mapping from the virtual address to a physical location on the main memory. Yes. And under memory pressure, like, let's say we get a new process and there is no, there's not enough main memory left. So under which the OS does swap out pages from the main memory back to the disk. Yes. So even with the buffer pool, we could end up with a case where the... All right. So the statement is that the operating system can decide that it wants to flush out the page table for a process when it starts running out of memory. So even if you have, even if the data system has its own buffer pool, its own memory could start swapping out those things. Right. That is correct. I mean, honestly, if you just run out of memory, you should have luck anyway, right? So mmap or your own buffer pool manner doesn't, not going to help you, right? Right. So why not use mmap and read the whole database into, so that whole database into the main memory, so that you don't have a disk access when you access the page for the first time? Okay, so we'll get to this. So his statement is, what if I just take mmap from my memory database, just mmap the entire file, scan across everything, swap it in, use madvice to pin it all, right? Isn't that the same thing? Absolutely, you're right. Yes. Right. Because that's just mallicking really large space, right? It's exactly the same thing. But you're not getting benefits to mmap now, right? Because you're just pinning everything in memory. So sending is mallicking. The other difference is that the contents of the memory, the data in memory has to map exactly what the contents on disk. So maybe the case you want things to be sort of compressed on disk, uncompressed in memory, you can't do that with mmap very easily. You have to do one extra copy and make that work, right? But you have to be right. If you just mmap the entire thing and pin the entire thing, then it's the same thing as mallicking, right? But then are you really mapping, right? The other thing too, I also say too, is the, I had an experience with this. The hyper guys have told me that there's various sys calls for mmap that are not portable. So maybe work in one version of Linux versus another one that doesn't work exactly the same. Supposedly, I should find out what this actually is. There's some patch that IBM wrote with a Linux kernel that solves a bunch of problems for mmap that actually makes it actually something that we could use for a database this time. But that was rejected by the Linux kernel guys, right? So I would say this is something that I'm extremely interested in. This comes up all the time, right? Non-database people tell me why, you know, why can't I just mmap? And then, or new database start, so like, yeah, we're gonna think about using mmap. And then across the old database guys are like, no, no, don't do that. You're gonna have a hard time, right? And then people learn the hard way. So this is occurring over and over again. We're keeping trying mmap. Splunk is another example. Splunk started all the mmap database and they ended up banning it once they found out how terrible it was, right? There's no good paper that says, here's exactly why it's a bad idea. So I want to write that. It's how I've done it. The bottom line is, again, a well-written database system is always going to know best. It's always going to know better than the operating system because it knows what the queries are, knows what the workload is, and can make better decisions. Yes. So why does MongoDB instead of using mmap giving tons of money and telling it? So the statement is, why does MongoDB insist on using mmap giving them tons of money? They don't. They ditched it. Default is wire type, which doesn't use mmap. They did it as a shortcut, right? Instead of having to write a bubble manager, which is hard, mmap. But they still remain sort of like option for that? Legacy. They don't advise you to do it, right? If you have pain and cost for it, you've got to make sure that people want it, you give it to them, right? But they certainly don't recommend it. Okay. So if I die, put on my tube stone, and then we use mmap for your database. Okay. Let's keep going. So one of the changes you can have now. Concurrent control, a big observation now is it's going to be that the cost requiring a lock in memory is not the same as actually reading the tuple of memory, right? And on a disk basis, I had to go check the lock manager, go on the lock table and update that, and then go to actually check, you know, get the tuple, get the data. But if everything's in memory, then what I really want to do is just go access the tuple. And this has two ways that this could come about. One is I can embed the locks if I'm doing fine-grained locking. I can embed the lock and extra tuple itself. So now I only potentially have one cache miss to acquire the lock and read a tuple. Or if I'm using coarse-grained locking, which I think the paper discusses, then I can have no locks, except for a single partition or shard. And then I can have a single threaded execution engine run without worrying about simultaneous transactions and run very quickly. I'll say also too for that paper, although I actually really like that paper. In some ways, it was the motivation for the system we built called H-Door and BoltDB. So it's a bit of a, it's a bit of a, not marketing, it's a bit of a paper that's designed to justify certain decisions that was made in the H-Door BoltDB projects. So I actually don't agree with their argument that you want to use coarse-grained locking entirely. For some cases that you actually do, but not for everything. And for this class, we're going to focus on fine-grained locking. Because the domestic papers that come out, as for the paper you guys read, they do fine-grained locking and solve the problems that they address. So again, we talked about four, we can store the locks, decorate the tuple, and then it's going to help us with cash locality, so we're not having much misses. We're going to see this throughout the semester, mutexes are slow, don't add mutexes. If you think you need to add mutex, ask me, because the answer is probably no. And instead we'll use spin locks with compare and swap, because that's much more efficient. So again, unlike before, with a disk-based system, where we had to deal with simultaneous transactions running at the same time, where one transaction could stall because another one could start running at the same time. In this modern environment, we have these multi-core and mini-core CPUs, but now we're not worried about stalling the disk. Now we have all these different transactions, different threads running at the exact same time. It could be accessed in the same thing. So a lot of the problems with isolation and atomicity still matter in our database. It's just the reason why we have to deal with these things is different. So the next thing we talked about are indexes. So in the 1980s at Wisconsin, they came up with some specialized indexes that were designed for in-marry databases. These were designed at a time when the speed of CPU caches was pretty much almost the same as main memory. And so they make certain decisions where they're trying to produce the amount of memory being used, and they did much better than a disk-oriented index because of this. So in particular, the one example that we'll talk about later are called t-trees, like the letter t. When you see the data structure, it would be obvious why it's called t-tree. But then what happened is over in the 90s, CPU caches got much faster than the memory, and some of the design decisions from the in-memory indexes from the 1980s no longer made sense. So most of the indexes we'll talk about are going to look a lot like the disk-based indexes, but they're going to be tweaked in such a way to improve cache performance. What we'll see in one paper is that actually the B-plus tree, which originally designed for a disk-based system, actually performs still really well in an in-marry system. So all these specialized in-marry indexes, people no longer do. Everyone pretty much does a treat. Something that looks like a B-plus tree. So another important difference about in-marry databases versus in-marry databases with indexes is that we're not going to mold any changes we make to the index. So remember I said in the beginning that with that buffer pool example, the index pages are actually stored in the buffer pool as well. So anytime I modify the index, I got to write a log record that says here's the change I made to the index, and then I have to flush out that index page at some point in the future. In an in-marry database, we're not going to do that. The reason is because we want to avoid all the log you had overhead at runtime to write out changes to the index to disk. Because what's going to happen is when we restart the system, we're going to load a checkpoint in to populate the database in memory. And that's essentially why I have to do a sequential scan across the entire table. So since reading from disk is the slowest thing, it's not a big deal for us to build the index on the fly as we populate the table. So as far as I know, no database, in-marry database, actually writes logs changes to the indexes to disk. We'll just rebuild it when we crash and restart. And that's good enough. Yes? Do you make those not preferred because they block the thread? This question is, are butexes not preferred because they block thread? Exactly, yes. Anytime you make a sys column, they're in the operating system. The operating system is our enemy, right? We want to avoid as much as possible. Or it's not our enemy, but it's like our parent, right? So if we're like, if we're drinking with our buddies or, you know, in high school, we're drinking or dealing with drugs or whatever, like, we don't want our parents to find out. So we want to do as much as possible without telling the operating system what's going on. Because if you block on a mutex, so a fast user mutex from Linux, that's a spin bot with a mutex. So you try to apply the spin lock, you can't acquire it, then you drop down to a regular mutex. And now you're telling the operating system that, hey, my thread can't do anything, don't schedule me. Going to the kernel, we want to avoid that as much as possible. So a spin lock to prefer. So why not a Qtex? Because Qtex does the spinning initially, right? Only if it does spin for some time, and only if it can't get the spin lock, and then it goes. So why not instead of spinning, till you get the... So it's like, you try to spin, you can't get it, then you drop down. We don't want to drop down, right? That's just, we don't want to do that. Okay. This will come out, I think, next week. Yes. So we prefer spin locks because we're assuming that we have another thread that's going to make progress and free up the locks. Yes. So these are for latches, right? So we're spin locks for latches. We're assuming that the operation we want to do is short, so whoever has that latch will do the operation and then pop out and be done. Like, you don't want to hold a latch, then go over the network and do something, right? We're not talking about that. It's like, I'm going to update my hash table, release my latch and give it back, right? So at some point, you'll get the latch and make progress, right? So using spin lock in a normality with transaction data means it makes sense because we're assuming that the duration for which the lock is going to be held is very... Right. So we're using the latches for the data structures. We're not using them for the high level locks, right? So for that, you're going to use deloc prevention or deloc detection, right? That's standard to be locking techniques. For the latches, you just spin until you get it and then you get it and do it. Good point. And then we end up... We're going to end up implementing the MCS in latches, right? The ones where you can sort of reduce cash flow handstrapping or anything like that. There's going to be more specific things, which I think actually the spin lock in Linux, like the MCS as well. Okay. So keeping them locked. Query processing. So the way you execute a query and a disk basis system has changed when everything's in memory, right? And spinning this hard drive from the 1970s or even now, doing sequential access is much, much faster than doing random access. Even on SSD, it's still not exactly the same. But now in memory, although sequential access is still preferable because the pre-better can help with things, it's still not a major difference performance. So there are certain strategies to how we're going to execute queries and do scans and other things will change because we're actually in memory. Another important concept is how we're going to actually process the query, actually the operators, the classics or tuple at a time or volcano model that we would do in a disk base system, where you're calling next from one operator to the next, is actually going to be really slow because it's all going to be dysfunction calls. And so we'll see next class actually how to compile queries to avoid all these function calls to speed things up. So this is going to be a much more bigger issue in all that queries because we're reading possibly millions of billions of tuples for OZP. This is not going to do. All right, so logging recovery as he was saying before, this question was, are we assuming that the database is permanently permanently in physically stored in memory? Of course not. Obviously, if you lose power, all the contents of memory is blown away. So we still need to maintain a right-hand log or some other logging data structure to make sure that if we crash, come back, all our changes are normal. If we tell the outside world you're transaction committed, we better make sure that that data is safe and we can always bring it back. So although this is not specific to inner databases, you still do this in disk base system. We're going to use group commit to batch scale our log records and then amortize the cost of calling fsync to flush things out. Now we'll see this later on in the semester when we talk about logging schemes, but there may be actually more lightweight methods we can use than the sort of standard areas approach that we discussed last semester. The key thing about this is we only need to store redo information. We never need to undo because we don't have any dirty pages that have gotten written out. So when transaction commits, that's it, we're done. We just need to make sure we need to be able to redo it. We never have to undo it. And because there's no dirty pages, we don't need to maintain LSNs all throughout the system. It's always easier to implement these things. We're still going to use checkpoints like in a disk base system. We do this because we don't want to have to replay the entire log after a crash, but there's a bunch of different ways now we can play around with checkpoints that we couldn't possibly do in a disk base system. So this is one example from the hyper guys. They actually don't do this anymore. The idea is because we're in memory, we can just fork our process and now the operating system will know that anytime you modify a page, it does copy on right and then makes a new copy of the page with it with the change. So now our fork process could have a consistent snapshot of the database in memory that it can then write out the disk. So you can't do this in a disk base system because some things are in the buffer pool, some things are going to be on disk. It's bifurcated across two different storage mediums. So going along, I'll cover this more later in the semester, but everything I said so far is assuming that everything is in memory. We'll see techniques at the end of the semester about how to actually bring back the disk in some cases to support databases that exceed the amount of memory that's available on a single machine. And the idea here is that we want to be able to do this without bringing all the slow crap that we got rid of in the first place that we're slowing this down in the paper that you guys read. So the idea here is that for hot data, keep all that in memory and then all the cold data that's not going to be updated, it's okay for us to write that out of the disk. And then the difference here is that the data system is going to assume that any data you need is actually going to be in memory. And only when it's not, then it does the special case of going out of the disk. Whereas in the disk-oriented system, it's the opposite of that where it assumes that the thing you need is actually not in memory. So again, we're going to discuss how to do this later in the semester. The way to sort of think about this is say you post on Reddit or Tinder or whatever you want. The thing that everyone's commenting on that for the articles that just came out today, that's all hot. That needs to be in memory. And that's the thing you're updating when you add new comments. But then you're not going to go back and add a comment for an article that's 100 days old because no one's going to see it. For that, you can flush out the disk. And then if you need to for all that query, count the number of comments per article for a period of time, that's read only. Then you can bring that into memory as needed. I'll skip mostly this from now, but I've sort of alluded to this before. But there's a new type of memory that's actually coming out called non-volta memory or storage class memory or persistent memory. Think of it as it looks like DRAM if it's in a DRAM slot, but actually is a different storage medium that is actually going to be persistent after you lose power. And it's almost the speed of DRAM. So this is actually something we've been researching a lot here at CMU. My PhD student Joy has basically written a definitive guide on how to build a non-volta memory database, which I'm pretty excited about. The jerk has always been that non-volta memory is always two years away. I was first getting interested in this stuff in 2008. HPE announced that they found the member risker. Then they said everyone's going to have it two years. Two years later, they kept saying it was another two years, another two years, another two years. So now in 2018, I am told by my friends at Intel that it's actually coming out for real like this year. So we'll see. We're excited. Actually, I know it exists. I know companies have it. They have, you know, but we're not going to use it. We don't get it yet. So we'll come in more later in the semester, but I'll say the reason why I bring this up now, because if you see this in the news when it comes out during the semester, you think, well, am I wasting my time taking an in-memory database course when I really should be taking a non-volta memory course? And the answer is that all the techniques that we're talking about to this semester are still applicable in a non-volta memory database. You don't want to buff old manager. You want to implement your queries a certain way, but all that is still valid. And I would say what's going to happen is when MPMH comes out and comes from a prevalent, a bunch of the stuff we'll talk about this semester is going to be pushed into the intro class in the fall, because all the crap we talked about last semester about getting the sequential scans and buffer pool stuff, I think all that's going to get scrapped and everything's going to follow the things that we're talking about today. So we're talking about future databases now. So we're talking about current in-memory databases now, but a lot of the architecture will be then rolled over to non-volta memory databases when it comes out. Another thing about this is if you have a non-volta memory and you have an in-memory database with minor changes, you can make it work. If you have a disk-based database system, you have non-volta memory. It's going to be a major rebate, a major software engineering effort to switch it over. And you saw this from the paper you guys read. They talked about how modifying the system to remove all the different parts to make it be this bearable system was really, really difficult. All right, so we only have a few minutes left. I'll skip the high-level review for these systems here, but I just want to say these are some of the in-memory databases that are going to come out throughout the semester. These guys up here, Oracle, Times 10, Dolly, and Ultibase, these were the original in-memory databases from the 1990s. Times 10 and Ultibase are still available. And then since then, since maybe like 2008 or so, then there's been a ton of other in-memory databases that are now out. STB, HANA, A-SHA, both of these, one of our ton. And then these are a bunch of systems we're going to read about throughout the semester. Hecaton, Silo, Hyper, MemSQL, and DB2 Blue, and Apache Geo, I think is the old gemfire system that Pivotal dumped. I don't know how good that is, but I've never used it. But these things are going to come up throughout the entire semester. And they all do different things. They all implement certain features that we're talking about in this class in different ways and be good at preparing for contrast then. Again, I'm going to skip this for now. I'll just say Ultibase and P-Time are very influential. P-Time is really awesome. Both of these come from Korea. The Korean guys were awesome at in-memory databases. So HANA, the SV guys bought P-Time, and now there's an awesome in-memory database lab in Seoul that works on HANA. And they're all the XP-Time guys, which is cool. Okay, so these will be on the website. I'll skip this. I think you really care about those. That was just right. So for the last few years, the bulk of my energy has been spent on building a brand new open source in-memory hybrid relational database, database management system called Peloton. So hybrid means that we want to support the transaction processing workloads that you've read about in the paper, as well as analytical workloads where you're reading a lot of data at once. So there's a bunch of different features here, which may not make sense now, but it'll make sense as we go through the semester. These are the kind of things that we have actually implemented in the system. So we have a LAC-Tree implementation on multiburgers in Korea. We have a LAC-Tree BWT-Tree index. This is the BWT-Tree index, or BWT-Tree data structure was originally proposed by Microsoft using their Hecaton system, but it wasn't open source. So we had a student that took this class in building this index over the summer, which is not easy. And we actually just got our paper published this weekend in Sigma, and discusses it. So I was all excited. I thought BWT was this amazing data structure. This is clearly what you want to use if you're using in-memory data structure, and then it ends up getting crushed by a BWT-Tree, right? But that's okay. We'll discuss more of this on the recitation tomorrow, and then class on Wednesday, but we have a new LLVM-based execution engine. So we do code generation and query compilation. So instead of interpreting a query plan, we actually generate machine code with LLVM and that baked to do exactly that one query, and then we execute that. Again, I'll cover that in a second. We have a tile-based storage manager. It means we can do row stores and column stores. We're multi-threading, meaning we can have multiple transactions run different threads at the same time. We don't support intra-query parallelism, so we can't take a single query and break it across multiple threads. Working on that, I hope to have that in a week or two. We used to have right-hand logging and checkpoints. It has been written three times now, and all three times it ends up being broken and not working, and we have to remove it. This is something I'm very interested in having working on this semester. We can actually finally have right-hand logging for real. We have a Cascade-style query optimizer. This may not mean anything to you, but this is a big, big, big deal. This is the Kurtz Graphi Guide I mentioned before from the volcano stuff. This is a paper that he wrote in the 1990s that you guys will be reading this semester. It's a style of implementing a way to generate and optimize query plans. In my opinion, this is the best way to do it, and this is actually the way SQL Server does it, and we'll end up reading a paper that shows that SQL Server actually has the best query optimizer. I think our implementation is actually better than Postgres and MySQL. We don't support everything yet, but we're set up to do some great things. We support zone maps on blocks, and then we just add support last semester for culinary support for doing PLVG SQL UDS, so user-defined functions, and we can compile them in the element engine, which is pretty cool. The system crowns support some of SQL 92, not all of it, but we're slowly adding these features as we go along. Again, I'll probably maybe spend some time in the recitation tomorrow to give some of the background history of Palatine. Basically what happens is we took Postgres, rub it up the bottom half, rewrite that, get the top half, and then we have all SQL. We have the query, you know, they're networking layer, they're catalogs, and everything. Soon as the top half is really slow too, so we rub that all, and now we'll be ready to come scratch, and that's why we had to redo all these things. Alright, project one. So project one is you're required to implement three SQL string functions, upper, lower, and concat. The idea is that these are really stupid, simple, why are having you do this for the first project? And obviously we all know how to write upper. It's really for you to understand how the system all, this new element engine all works, and how to stitch it all together. So from concat, you only need support to arguments, even though the real concat is for multiple arguments, and then lower and upper, you just take one argument. Again, the idea here is just for you to get your feet wet in the system, do something small that's well laid out and guided to working on it, and then I hope you then build out your knowledge of the system to allow you to do more sophisticated things. So the project essentially has three steps. It's all laid out in the project classification. So the first step, you implement the string functions. String in, make it lower, put it back out. Really simple. The next step is then to register those functions in the catalog. So what happens in the catalog, say you take the string upper, and when you see that in a SQL query, the binder then has a look up in the catalog and say, do I have a function called upper? If so, where do I find it? And that way it then it does that for you automatically. Right, because otherwise it'll get, you know, function not found. Then the last step is probably the more tricky one. This is where you add the function proxies to allow the execution engine, the LLM execution engine, to invoke your SQL plus code. So what's going on here? The LLM engine is not, when you see all that code in the code itself, that's not actually being executed when you execute your query. It's actually executing that code, creates new code that then gets executed to execute your query. Right, so the proxies away for the LLM code to actually then invoke a C++ function. We're seeing a lot of blank faces. Does that make sense? So we're not, and we have to generate this code that we want to execute for our query. We then compile that into binary to the machine code, which then we then invoke, which then will invoke your function. Okay, so that's the idea here. That's why we're having you do this. And then the lecture on Wednesday will be all about what query compilation code generation actually looks like. It sort of fit into a higher frame of mind. All right, so we're providing you with a basic student SQL plus unit test case that you can then extend to add whatever testing that you want to do. We also have a simple SQL batch script that you can then pipe into with psql on the command line and see whether you're getting the correct output. Right, so the one thing I'll say about the current LLM engine, it does not support index scans. So you can't have a where clause where you say, you know, where ID equals something, and that ID, you have an index on that. Right, if you don't have a where clause, you don't have to worry about it. It'll always do a sequential scan and you'll hit your LLM code correctly. Okay, so again, we're providing you with these basic stuff, but you should definitely extend the SQL plus unit test case or the batch script to do more complex things. So when we actually grade your submission on AutoLab, yes, I know it's broken, we'll work you on it. We're going to run additional tests that go beyond what we've provided you guys, and this is how we're going to use it for grades. So we're going to try to cover all the corner cases, although there's not many for string functions, we'll try to cover all the corner cases and make sure you do things correctly. We're also going to run the GCC sanitizer, make sure you're not leaking any memory. And we're also going to compare your, run your code with the client format script to make sure that you follow our formatting guides. Right, there's one thing I hate more than M-Map, or just as much as I hate as M-Map, it's tabs. Do not use tabs. Okay, and you would lose points, I'm going to throw an error if you use tabs. Okay. So this was in the write up all since now. So Pelton only builds on 64-bit Linux and OSX. Someone posted about having problems with high Sierra on OSX. I don't have Mac, so I can't test this for you. I know from OSX, make sure you use Xcode version 8.3. Maybe you try to fix that. It has to do with the version of LLVM you're using. So hopefully maybe we can bump up the version of LLVM and we can use newer versions of Xcode. If you don't have a machine that has these environments, we provide you with a vagrant file that you can use to generate a version machine on your local development box, and this will set up the entire development environment for you. So there's one command you run, vagrant up in the directory, and it downloads the image and then configures it for you automatically. It's pretty good. So this is Carnegie Mellon. I assume that everyone here has access to a machine that they can use for development. If you don't, please email me. We can figure something out. But we don't have like, you know, lab machines that are already set up to run everything like you do in some of the intro classes, okay? All right, so the project will be due on a week from now on January 29th at midnight, and then once we get all the lab running up and running, you'll be able to submit that and test your implementation. And then the recitation is tomorrow afternoon at 5 p.m. up on the night before, and for Sean and I, we'll go into more details about how the different parts of the code fit together. And if you have questions about, you know, I hope that this doesn't make sense, please come to that, okay? Any questions about project one? Yes? You mentioned the C++ unit test case. Is that like the string functions test file, or is there like a separate file? For the grading or for what we give you? It's the string function one. It just shows you how to do it. Okay, let me submit on Ottawa. I know you said you were going to run an extra test, but we see our final score. Yeah, you'll see your final score and it'll break it down to tell you what, like, you want to see what the extra test was. We'll see whatever output you print out, but it'll tell you like what you feel in that score, right? Yes? Aaron? Oh, how long will the recitation last? The question is how long will the recitation last? As long as you need it. We're thinking like, you know, 9 p.m. Okay, all right, so my party thoughts are, this morning, Davis's are rather in the past. I mean, like, my C++ and Oracle are probably way more times than in-memory database. Disk-based databases aren't going away, but building a new system for scratch, I think there's a lot of, I think the in-memory database is the way to go. There's always the outliers, there's always Amazon, the Facebooks, with these really low datasets where they have to go to disk, but for our purposes here, we're going to assume everything's in memory. And certainly the time has come, we're now comfortable with this idea of in-memory database. When I first started in grad school, we went on a sales call to a Silicon Valley company, a large one, with Mike Strembricker for BoltDB, and they were sort of, you know, aghast of this idea of storing your database in memory, right? They had, we don't trust it, we don't want to do this, but now it's come, people are more comfortable with this, right? So it's no longer an issue. And again, when I died, put this on my tube stones, never use that method. Okay, so next class is sort of an anomaly through the semester, because we're going to be mostly talking about transaction processing and all the TPP stuff at the very beginning, perpetual indexes. But I've moved this lecture that I normally teach at the end of the semester, now to be the third lecture, this query calculation stuff, because I think it's so important, and it's going to, again, permeate throughout everything we talk about this entire semester. And certainly is the core thing we have in our system that differentiates it between other data systems you may be familiar with. So I'm going to sort of front load the topic now, and then switch it over to transaction processing, the country show, but we're going to see this throughout the semester, this is going to come up over and again, right? Not very many systems actually do this. It's maybe a small handful. Spark actually does this now, right? So this is so important, I think it's to talk about, we want to talk about this early, okay? Any questions? All right guys, see you on Wednesday.