 It's going to be a video and it might look kind of weird. You think so? Yeah. Hold on, that doesn't, like, it just says men's sequel. Who cares? All right, let's do this. All right, guys, last class, last semester. As I said, today's going to be broken up into two parts. So we'll quickly go through some of the things we have to deal with for the end of the semester. And then we'll talk about having to, what topics we're going to cover in the final exam on Wednesday. And then Ankur Ghol is here from men's sequel. And he's going to give a tech talk about some of the stuff that we're going to work on. Hopefully, you'll see that all the things we talk about in class, being an example of a lot of those ideas actually in practice. All right, so the major things we've got to deal with for the rest of the semester is the final exam will be this Wednesday in class. It'll be an hour and a half, the normal class time. And it's just going to be three short answer questions. And actually, I'll pass this around. So this is a, there'll be two sample exams. There's two questions to give you an idea of what the real exam will look like. So one copy will be just the straight questions. And the other copy will be with a solution. So if you want to practice them. And as you'll see, we'll talk about it in a sec, but I'm not looking to do like, does MPCC do this? True or false? Like sort of simplistic things. It's really meant for you to synthesize the ideas and topics and sort of try to apply them to solve other kind of problems and questions. And I'll talk more about that in a sec. And then we have the final presentations on our allotted final exam time on May 6th. It's a Friday. It'll be at 1 PM to 4 PM. And the basic idea is, again, that you're going to present what the final outcome of your project was. Maybe do a demo like we did last time, show some benchmarking numbers using the MemSQL machines. It's just a way to wrap up everything you've done in the semester so far. And then also that we covered last class or last week. The code reviews have sort of two due dates. The first one will be the pull request due date where you have to provide the pull request to your other partnering group to look at and do the review. And then they have three days, essentially, to do their view and provide you feedback. And then you will have two days after that to make any changes that they requested and then send us the pull request for the final code drop. So for the final code drop, as I said, I'll provide a post on Piazza with more documentation of what's expected. And we'll come up with the order in which everyone has to do the code drop. Some of you have already been merging stuff back into the master, so that'll make our life a little bit easier. But the basic idea is that you have to provide us a pull request that can cleanly merge in past all tests that you provide as well as the existing tests. And then you have to make sure that any feedback you're getting from the code review has to be applied. You have to actually make those changes as long as they're in reason. If someone writes some kind of bullshit that says, oh, you should have wrote this in Scala, we can ignore that. But if you have a bunch of printf statements and they say, get rid of it, you should be getting rid of those things. You also need to include all the test cases you have and this be included in the test suite. So they can't be these one-off scripts that you have to run on the side. We call make check. It definitely should be running your new test cases, no excuses. And then I'll also provide another write up about this. In order for what we want to have is any new feature or things you implement in this system, we want to be able to post on the Wiki page and say, here's this new thing that someone so wrote, here's what they've done, here's how to actually use it. Like if there's a config knob, you need to enable it or a compiler flag, you need to enable it. We want some basic documentation that says what was actually done. Because then what I'll do is on the course website, I'll write a summary of all the cool things that you guys done and provide it as a showcase for the projects. So again, I'll tell you more about what I'm going to need, so I'll need description of what you've done, any cool screenshots of what you've done. I'll include the PowerPoint slides from your final presentation. And then very important, I'm going to include all your names and whatever LinkedIn page or whatever address you want it to go to. And that way, when people come along later on and say, well, here's all the things we did this cool project in this class in CMU, here's all the kids that wrote it, they'll know who you are. Sort of helping you guys advertise yourselves. So again, the final co-drop will be May 11th, but this is sort of a, because I think this is the date I have to provide final grades to the register. But hopefully we can get people to start merging stuff in sooner, so we don't have all this bottleneck at the last moment. For the extra credit, the website is unfortunately still not ready. So instead, what I'm going to do is provide you with a JSON schema of the taxonomy for your assigned database system or the database system you chose. And you should be able to write your article within the JSON. It's not ideal. But then once the website's ready, we can just take that and suck it right in. And again, I'll send another post about how to do this. This extra credit assignment for the Encyclopedia, this should really only take you a short evening. It's not something I want you guys to really sort of stress about. But you do have to provide citations and everything for all the information that you find. So you're just not making things up. And then lastly, before we get to the final exam, it's really important for you guys to help fill out the course evaluation. You guys are essentially the guinea pigs for me, because it's the first time this course has been taught at CMU in 10 years. It's also the first course that I've designed from scratch. So I'm really just trying to figure out what works and what doesn't work. So I know, because when I was an undergrad, when I was in grad school, I didn't take the course evaluation seriously. I just said, this gave me diarrhea or something weird like that. But I implore you to please actually provide me useful feedback, because that's going to help me make the course better. So there's things that I already know that were problems with the course that you guys don't have to bother with. So I know that Project 2 was really hard. I know that AutoLab was a big pin in the ass to actually use, it never worked correctly, it never updated the score. I know these things you don't need to tell me. I also know that you guys want a better feedback or one of the grades in a more timely manner. So I plan to do that better in the future course. And I also realize, and this is especially important for people in the front row here, I know I have bad hygiene. I've been seeing doctors about it. And I'm on a special diet now, so hopefully this will get better. So you don't need to write that, that I smell terrible in the course evaluation. So again, the things I'm looking for, if you thought the lectures were too deep or too broad, if you thought the material that was covered was not helpful, if you thought that the way we organized Project 3 was not helpful, if you wanted more guidance about certain aspects of the course, that's the kind of things that I'm looking for. So please also do this, not tomorrow, do this after the final exam and after you've done the final presentation, so that way you sort of have the full view of what the course is all about. Because there's still a lot of things we have to do from now until the end of the semester. OK? All right, so any questions about these things? All right, the final exam, as I said, everyone should have two copies of the sample. It'll be with solutions, without solutions. The final exam on Wednesday will be three questions, short answers, or short essays in the same way that I sort of written the sample questions to provide to you. It's closed notes, and I would say it's more of an emphasis on the bigger ideas that we covered in the course, rather than knowing exactly what individual paper we covered actually did. So the course, as I said before, is sort of broken up into two sections, right? The first part was the sort of, I'd say, the core fundamentals of in-memory database systems. So we covered things like concurrency control, the storage models, the query processing models, join algorithms, logging schemes, and then OATP and OAP indexes. So these are also the core topics that are going to be covered in every single question. And then we can go a little bit deeper now and talk about all the things that came after this sort of halfway point in the course, where we started talking about the more advanced things. So we talked about different compression schemes. We talked about different ways to optimize the query execution. We talked about query compilation, code generation. And there was sort of the C plus plus code generation way that MemSQL and the other guys were doing. And then there's a more modern version using LLVM. There's the scan sharing techniques, where you can have multiple queries combine their query plans and be able to reuse results. And then there was vectorization or SIMD stuff. And then we never really had a single lecture on this topic, but these concepts kept reoccurring throughout the semester about the different hardware optimizations that we can take advantage of in the system. So we spent a lot of time talking about cache locality in numerous regions. We spent a lot of time talking about vectorizations and SIMD instructions. So these are the things that, if you have a general understanding of what they do and what the trade-offs of them are, then you should be OK. That's all I have for that. Any questions? So there's some things I know I'm not going to ask about. So for example, the query optimization, query planning. Because it's kind of hard to give you a question about that. And it's not like when you take an undergrad course where I say, your table has this and many tuples. And this other table has this many tuples. How many blocks do you have to read? Or how many tuples are you going to evaluate? They're higher level things that are trying to see what you understand the deeper concepts. Because you can always look up in a textbook and see what the big O cost is for these algorithms. It's more important to understand the bigger principles. And you had a question? Yeah, you have a question or no? Are you going to post the PowerPoint? I'll post the PowerPoint, yes, later today. I'm not going to post the sample exam. OK, that's it. OK, Murray, close notes. And we're doing a blue book, so bring a pencil or a pen. So, ready? Yes. I think there's a VGA after I do that. So I'm super happy. Oh, you need this, too. Make a joke about being San Francisco. I can answer to you, Ty, but that's OK. So I'm super happy to have Ankar Boli here today. He is part of the reason that he is, the reason we have the machines you guys are able to use for the class project donated from us when I was in school. So Ankar is actually CMU alum, but he's also playing number three for five at MemSQL. And he actually, but he graduated from CMU last fall. So he leapt CMU after his third year, went to San Francisco to figure out what was going on in his life, and then he was able to convince the register that he had enough credits to actually graduate and he's done, right? Yeah, approached him or so for being very, very helpful. So he's technically only out of school for six months. He'd been in MemSQL for several years now. So he's here to tell us about the cool stuff that they've been working on. Cool, yeah. Nice to meet everyone. A lot of familiar faces in the room as well. Is the audio all good? OK, cool. So I like to always give this caveat at the beginning. Please do feel free to raise your hand or shout out or whatever ask questions during the talk. I definitely prefer an interactive style of doing a talk rather than saving all the questions to the end. And the other thing is there's a lot of content in here. So we could easily go on for like eight hours and talk about all of this stuff. But feel free if there's an area that you're particularly interested in, Andy, or students you guys are interested in, we can dig in a lot more depth. So today we're going to try to answer basically some questions about in-memory databases. And a lot of this, I like to think of this talk, is kind of guided by how we do things in MemSQL, but not necessarily pinned to how MemSQL is built. So some of the questions we're going to answer, what is an in-memory database? Hopefully my understanding is consistent with Andy's, but it might be a little bit off. And also kind of, all right, there's an in-memory database. Why do they matter? Why do people care? And then kind of the fun questions, how do you actually build one? And if we have time at the end, how do people actually use MemSQL and why would they pay us for something like MemSQL? And the kind of order in which we're going to cover this stuff, we're going to look at kind of what are in-memory databases first, and then a number of sort of architectural topics around in-memory databases and kind of how MemSQL is architected. And then like I said, if we have time, we can talk about MemSQL in the wild or go over that in Q&A. So really briefly about myself, thanks Andy for the nice introduction. I was here at CMU from 2008 to 2011. I graduated technically in 2015. I did a little bit of research at PDL while I was here. I also did an internship at Microsoft and have been at MemSQL since 2011. And currently, I am the VP of Engineering. And I really do love databases. The really cool thing about databases, databases kind of encapsulate all of the really fun stuff in computer science. There's everything from AI and the optimizers. Nowadays, people are looking at machine learning and kind of cooler and newer forms of AI even in databases down to systems like operating systems and networks. So if you like computer science and you like building really hard stuff, you will never get bored working on databases. So let's kind of start. What's the first question? What is an in-memory database? So how many people in this room think an in-memory database uses memory instead of disk? It's not quite right. How about in-memory databases do not need to save data on disk? I think Andy probably killed this one. Yeah, so that's definitely not right. How about this one? Put the whole data set in memory. That's probably the most common misnomer. Some in-memory databases do. Until in-memory databases became very, very commercially viable, this was the case. But we'll talk about why this isn't always the case and why it actually always doesn't make sense. So if you look at Wikipedia, kind of the authoritative source of truth, what is an in-memory database? An in-memory database is a database that primarily relies on main memory for storage. So let's kind of think about what that means. First of all, I think no one in this room was confused about that. In-memory databases are durable to disk and respect acid. The model that we like to use in MemSQL for thinking about in-memory storage is you kind of have two options. You can spill data on disk. This is also called anti-caching. But it's kind of the same thing as caching. The difference between spilling data on disk and caching is that when you spill data on disk, you try to keep as much of it in memory as you can. And when you sort of run out of space, that's when you start spilling it on disk. And the other option is to pin data in memory. And sometimes it makes sense to spill. Sometimes it makes sense to pin. Each comes with kind of an associated set of trade-offs. And it also depends on what underlying data structures you're using. In general, in-memory databases are databases with trade-offs suited to systems with a lot of memory. So one thing that we like to think about is the thought exercise is if you have kind of a database maybe running on this MacBook Pro or a database running on the machines that we sent over, which have lots of memory, an in-memory database and kind of an old school database could run on both systems. But an in-memory database might not win on the MacBook Pro, whereas an in-memory database should win on a system with a lot of memory. So it's more about trade-offs than it is about what kind of system you can run it on. In-memory databases tend to be distributed systems. It's kind of for two reasons. The first is that modern databases are distributed systems because commodity hardware and the cloud and dot, dot, dot, there's a lot of reasons. The second is that memory is still pretty expensive. And so you can string together memory on a lot of machines very cost effectively with a distributed system. It becomes kind of exponentially more expensive to do so if you vertically scale a single machine to give it more and more memory. And the last thing, in-memory databases tend to have a difference out of bottlenecks. And that's kind of where all the fun is in building an in-memory database. You're building the same stuff. Storage, execution, optimization, all of this stuff, but the bottlenecks in the system are different. So you have to think about trade-offs and problems a little bit differently. So this background leads me to kind of a bold claim. The bold claim is that all database workloads will be running on in-memory databases. Now, I really, really believe this to be the case and I'm also really, really biased, but hear me out. So first, memory is getting cheaper. It's about 40% every year. What we see in the market is that every year when we go to customers and sell them MIMSQL, the type of machine that they want to run MIMSQL on, which is kind of the sort of economic optimal machine has more and more memory. So nowadays, it's very, very easy and cheap to get a machine with 512 gigabytes of RAM. And it's almost impossible to get a machine in the enterprise which has less than 64 gigabytes of RAM. And there's this blog post that we wrote and it's kind of showing to be true. Cache is the new RAM, RAM is the new disk, disk is the new tape, et cetera. What's really funny is that people are actually starting to build cache-only databases and benchmarks on cache-only databases. Kind of like people were thinking about in-memory databases maybe 20 years ago. Another thing is that in-memory databases, and again, we'll look at why, have some unique ways in which they can leverage sort of SSDs and get spectacular performance out of SSDs. And the thing that SSDs are good at is basically everything except random writes. And with an in-memory database, you can actually architect it so that you avoid random writes altogether and really, really kind of maximize what you can get out of an SSD. The next thing is NV-RAM is coming and could be cheaper than even SSD. NV-RAM is projected to come sometime in 2017, and this is a big deal because in-memory databases are spectacularly well-positioned to take advantage of NV-RAM, which is non-volatile RAM for those who are not familiar. And traditionally, architected databases, the way that they think about sort of buffers on disk doesn't really align with the trade-offs that you get with non-volatile RAM. So they're not really well-positioned to use non-volatile RAM. And really, what does this mean? So in-memory databases are nothing more than standard databases tuned to modern hardware and modern workloads. So not only are in-memory databases designed around the hardware that's available today, but those of us who are building in-memory databases commercially are tuning and thinking about how we build in-memory databases around workloads that are popping up today and becoming more and more dominant every day in the market. So the sort of big one that you could think about is real-time analytics and sort of thinking about real-time data. This is sort of a new trend today, but it's becoming more and more prevalent every day. And in-memory databases tend to be the databases that can actually handle these kinds of workloads. Any questions about in-memory databases and sort of the general stage? All right, so let's talk about how you actually build one of these things. We're gonna look at a few topics. I know Andy really wants to talk about code generation. We'll make sure that we spend plenty of time there. But we'll start with just talking about in-memory storage, transactions and concurrency control, crash recovery and replication. We can kind of go over very quickly. It's pretty simple stuff. And then code generation is really interesting and distributed execution is also really, really interesting. All right, so in-memory storage. Let's first think about what does in-memory give you? So the first thing with memory is that you have insanely fast random reads and writes. That is kind of the defining thing about memory that makes it very, very different than a disk. The other thing that you have is very fast atomic writes. So you can compare and swap with the granularity of a single byte or even two bytes. And you can do this very, very performantly. Doing this on a disk is even an SSD is prohibitively slow. Another thing is that the working space is precious. So like I said, RAM is still pretty expensive. Nowadays, you don't need to be so careful about how much data you store on disk because for a company to sort of double the amount of disk that they have available to a machine is an extremely straightforward proposition. You can cheaply add disks in a RAID configuration. You can cheaply add volumes on something like Amazon EBS, Elastic Block Storage. And you even have object stores that run across the network like Amazon S3 or HDFS, which are approaching the performance that you have with a locally attached disk. So getting more storage in disk is not that expensive and it can actually even be done online. So it's really easy to expand disk storage but with RAM, it's not the case. It's pretty expensive. The last thing which we'll talk about in a reasonable amount of depth here is that in-memory storage gives you a very different set of advantages for row stores and column stores and that's kind of because of the workloads that you use for row stores versus the workloads that you use for column stores. And so we believe that you should kind of think about these a little bit differently. So if we look at in-memory row stores, well, row stores have a lot of random reads and random writes. In fact, row stores are basically built to be very fast at random reads and random writes. Furthermore, datasets for row stores are usually pretty small. So row stores tend to track one row per item. And these things tend to be less than 10 terabytes. Obviously there are exceptions but for the most part, for most Fortune 500 companies, their row store workloads are less than 10 terabytes large. So the solution is to keep the whole dataset in memory. And if we keep the whole dataset in memory, then we can actually use a data structure that's optimized very, very well for memory called a skip list. And there are a number of advantages that we get by using this kind of data structure in MMCQL. So how many people here have heard of a skip list? I think, did you cover it in class? Awesome, so I'll skip through this. The big thing here is that there are no pages. So that particular detail is what enables you to implement a skip list lock-free and that's very, very significant. Here's a picture of a skip list which I'm sure everyone here is familiar with. So what are the common concerns with a skip list? One of them is memory overhead. So you have to store all of these things. Now here's kind of what it looks like in a B tree. So this is a clustered index and a non-clustered index. In a B tree, with a non-clustered index, you actually need to restore all of the columns that you use in a non-clustered index as well as some reference to the primary key, whether it's some sort of ID or all of the columns of the primary key. And so you actually incur a lot of overhead in storing the values of those columns. For example, index a relatively large string, you need to store an extra copy of that string. That's expensive. In MemSQL, what we can actually do is compress all of this information because we know that it's all in memory into the same struct that actually stores row data. So we know kind of at table compilation time or table creation time or even table alter time, how many columns are in the table and how many indexes there are. And we can in memory design the layout of a row to include not only the columns, but pointers to all of the relevant skip list towers. So if you think about the picture of a skip list, the base or the bottom sort of tower of the skip list is shared by every index and the towers are different for each index. What that means is we don't have to pay the cost of storing columns again. And we don't have to pay the cost of storing kind of those base level pointers again. And that saves actually a very significant amount of memory. The other advantage is that the cost of going from a secondary index to data in a row is zero because once we find a row in a secondary index, we can simply cast to get to the primary key data. Doesn't this mean you need to re-figure your table rows in every layout every time you add a top of index? Yes, but we can do it online. I know you do it online, but it's not cheap. It's not cheap, but it's not that expensive either. While it's happening online, you can continue to sort of read and write the table and you can even read and write the table with the new columns. So it's surprisingly sort of minimal impact. So is this an abstract representation of a table row or how do you add a new index? If this is the definition of the structure, how do you add a new index? So this is not, if you run MIMS SQL, and you create a table with two columns and two indexes, it will not literally look like this. But it's pretty close. Now the way that online ultra works is that every row has a little bit of information. We kind of steal bits from different places. And when I say bits, I mean literally bits from various places in the row that we know aren't necessary for correctness. And we use those bits to store a little bit of information about which version of the row it is. So when you run an alter, we know which rows have been altered and which ones haven't been altered. And on the fly, when we operate on rows which haven't been altered, we can semantically kind of fill in the things that you would need to alter. So for example, if you add a column, you must specify in most systems where you do something like alter table, add column, you have to specify a default value for a column so that existing rows which don't have that column can automatically have that default value for a column. And so what we can do is we can say, okay, if there's an old version of the row, we can just kind of dynamically add that column value until the row is properly moved into the new format. If only you didn't do it, you would not do it every day. Yeah, that's kind of the point, yeah. So this makes a lot of assumptions, especially the fact that all rows are represented exactly this way and rows are glued together with pointers. Why can't you do this? Like, I'm not sure why you can't do this with disk. Like, this is basically, I don't know, database level default value storage, right? So it's not whether I can do it with disk, it's whether I can do it online with disk. And so because this is restitching together, the indexes as it goes, and operating on the lock-free skip list rather than taking kind of page locks as it goes and manipulates rows, there are a few sort of weird cases that it covers that are, it's not impossible to do on disk, but very, very hard, yeah. And in particular, it's very hard to get good performance if you're doing it on disk. So if you have an online ultra that's running and it's acquiring locks over several rows for any row that it needs to touch, it dramatically impacts performance for concurrent reads and concurrent writes going through the table. In this case, the fact that we don't have locks per page basically enabled us to be a very lightweight operation while a normal workload runs. Another quick question. How often do you see it's hard to quantify it, but do you see a lot of orders happening? That's a great question. So a lot of people, especially database implementers who are people who worked on databases for a while, say, you know, ultra table is not that important because from my experience, people never run alters. And it turns out that the reason they never run alters is that alters are so taxing and expensive to run on a database system. And if you ask users to kind of blank sheet of paper what features really matter to you, they will write alter. And actually if you give them online alter, they'll start running it all the time. So it's not uncommon for people to run ultra table on a production cluster every day. Which with kind of a system that doesn't have online alter is extremely, that's unheard of. It's usually, let me take down this thing and once a year, have some downtime and run alters. Some day for that. Yeah. All right, so the next thing is scan performance. With skip list, you're chasing pointers, whereas with a B tree, you have this kind of nice locality. So this locality in a B tree is important because of disk. So you need to kind of write some page size and read some page size to disk so you avoid random reads and random writes. But it also means that when you elevate this data in memory rows are consecutive in memory. And so you get good locality and good scan performance. If you think about a skip list, and this is kind of the bottom tower of a skip list, this is the logical layout of rows. So this is one row pointing to another row pointing to another row. This might be what you expect the physical layout of these rows to be. So they're kind of spread all over memory. And if only this much fits in a cache line, when you go from here to here, you have to invalidate the cache and load this in cache. So you kind of take a cache miss. When you go from here to here, you go back here and kind of repeat this process over and over again. And this can definitely happen to you if you sort of poorly implement a skip list. It turns out that you can actually fix the skip list in the background and apply a bunch of heuristics kind of as the skip list is recovered or inserted or written to at all to make the skip list look more like this. And if you do this, you get basically the same performance that you get with a B-tree, because you're chasing pointers but without cache misses. So the scan performance is another sort of very common concern that people have with skip list. It turns out that you can actually solve this problem as well. So we're going to do a back on thread. Is it like calling optimize in mysql? Yeah, yeah, yeah. It turns out that actually the heuristics take you most of the way there. So with heuristics and with that background correction, you can actually end up for the most part like this. And furthermore, the heuristics tend to align pretty well with the heuristics that you have to get a good looking B-tree as well. So in practice, it turns out to be less of a problem than one might expect. So the next one is reverse iteration. So as you saw from the picture of the skip list, at least the one that I drew, it's singly linked. Doubly linked skip list, no one knows how to implement them lock free because you kind of have to fix two pointers instead of one. And reverse iteration then becomes a big problem because in database workloads, users expect, and because a B-tree gives you the ability to iterate both forwards and backwards on an index. So one of the most common queries is select star from events ordered by ID descending limit 10. That gives you the most recent 10 events. And if you have a singly linked skip list, which doesn't support reverse iteration, it's very hard to do that. And so we invented this reverse iterator, which I'll leave as a homework assignment to figure out how that works, which lets you iterate the skip list backwards and amortize linear time. And it's about 1.2 to 1.5 times slower, depending on how that skip list picture I showed looks. We're doing that as a project next year. Cool. And Andy knows the details. He sent me the solution to his homework assignment a while ago. All right, so next is concurrency control. So the big thing with the skip list, no pages means no latches. And so what that means is you can actually implement the underlying data structure skip list to be lock-free. We also have MVCC and mems equal where every node is a lock-free linked list. So the skip list is a skip list of lock-free linked lists. Each lock-free linked list is a lock-free linked list of versions. Now you still do need roll-locks to implement pessimistic concurrency for things like read-committed isolation. And we have roll-locks in mems equal, but because it's all in memory, we can use these few texes, which are a cool Linux primitive that only cost four bytes and give you very, very performant roll-locks. Oh, yeah. Few texes in the common case aren't a system call. You do, yeah, yeah, yeah. And for roll-locks, that trade-off appears to make sense. We do have spin-locks in mems equal for other locks where spin-locks do make sense. And basically, with these building blocks, we're able to implement read-committed and snapshot isolation in mems equal very straightforwardly. So that's it. With concurrency control, we can take it a little more. No repeatable read. Read-committed and repeatable read are different. Read-committed is a less strict isolation level than repeatable read. I think people argue, snapshot is not clear whether snapshot or repeatable reader, which one is stronger. They have kind of different semantics, but they're sort of different anomalies, you know? Or thought, I don't know. Or thought, I think it's very much more complicated than third-party serializing. Yeah. So do you have the assumption that, well, everything we should have a base probability index, also have a base probability index? I did not understand the question. The question is, do you require that every table have at least one primary key index? Oh, yes. So yeah, every table has a primary key. And if you don't specify a primary key as a user, we'll create a fake one under the hood for you. In the real storm, not in the column storm. So every operation should happen with the index, right? I mean, every operation should boost the index, right? Do you have supporters, secretions, scans? The two statements that you claim to be contradictory are not contradictory. So you can sequentially scan a primary key. OK. Yeah, and if you don't specify a primary key, then rows have an implicit primary key whose value is the order of insertion. So essentially, we map the case that you're talking about to having an index in memory column source. So first rule, we don't believe in this idea of pure-in-memory column source. Why? So column stores have big sequential scans and writes. And column stores are built on huge, immutable vectors of data. So what does this mean? It means you should cache the data set in memory. Why did we care about pinning the data in memory with a row store so that we could use pointers to glue rows together? And if you use pointers to glue rows together, you get all of these benefits that we just talked about, and efficient online alter. With a column store, you don't really have pointers taking you from row to row. You have huge vectors of data. And paying something like, is this vector of data in memory or on disk per 100,000 or million values is negligible. And so the cost of actually supporting something which spills and operating over a cache is negligible in the context of a column store. So then the question is, how do column stores benefit from living with an in-memory database? Or why is the mem sql column store have some advantage over a classically built, disk-oriented column store? Well, the answer is, we can actually use the lock-free skip list to help us quite a bit. And there are two ways that we do it. The first is, we keep metadata in memory. And the second is, we basically have a sidecar row store. And I'll explain this in a little bit more depth for fast, small batch writes. So the big thing in mem sql that we really care about with a column store that distinguishes our column store from every other column store is very, very, very fast writes. And again, I said earlier, the workload that really sets us apart or one of the workloads that really sets us apart is real-time analytics. And what do you think really matters in real-time analytics? It's very, very, very fast writes, especially on small batches of data. So this is what the metadata in the column store looks like. I think you covered column stores. And the general architecture of a column store is to have what are called segments or groups, or I don't know what term you used, of rows. Row groups of where you kind of take a group of rows, split it up into columns, store each strip of columns with encoding and compression, and then keep around some statistics for each chunk of columns that you stored, which tell you what are the minimum value, maximum value. You can have some other cooler statistics, like give me a bloom filter over these values, et cetera, and operate on this metadata to basically figure out which segments you need to read and where these segments are. So in MemSQL, these segments are actually stored in a lock-free skip list. And that means that operating over these segments can be done really, really fast. So in a lot of systems where you have this metadata on disk, if you need to do something like count the number of rows or find the minimum value for a column in a table, or even figure out which of these things you need to open up at all, you need to read from the disk to kind of read these metadata values. In MemSQL, this is all done very, very quickly in a lock-free skip list. The other advantage is that each of these segments that are pointed to by a row in this metadata are immutable. And what that means is when you write more data into the actual column or storage, you write a new immutable chunk of data. And kind of the mutable thing that you need to modify is the skip list. And the skip list is lock-free. Therefore, the implementation of the column store with respect to writes is also lock-free. So having this kind of metadata structure, which is very, very mutable in memory and very fast, actually has a number of advantages in a column store. Now, to understand the second benefit, which is that sidecar row store, we need to look at kind of how does the column store look to begin with. So in MemSQL, the column store is this thing called a log-structured merge. And it kind of looks like this. So this is a table, or sorry, this is kind of a log-structured merge of an integer column. And you can see that there are non-overlapping runs of data. Each run is sorted, and you could compute kind of an overall sorted view of this table by doing something very similar to merge sort. So you look for the minimum value across these runs. It's one. So you pick one, or you pick this one. It doesn't matter. Return it, and then advance to the next segment here. And you can use that to kind of construct an overall sorted view of the data. And if you read about LSMs, you'll find that LSMs have tunable trade-offs for read and write amplification. So the more of these things you have, the faster writes are, but the slower reads are. That kind of makes sense. You're looking at more of these chunks. The fewer that you have, the slower writes are because, say you write 25 in this one, you need to kind of insert it right here, which has implications for this half and this half of the log-structured merge. So writes are slower, but reads are obviously faster. Now, LSM on its own enables fast writes to a column store. Because what it means is every time you write to the column store, you don't need to think about potentially resorting an entire log-structured merge or an entire sorted run of data. You can kind of insert to the smallest sorted run, which is larger than the amount that you're inserting. And so that actually, from the perspective of relatively small batch writes, is very impactful. And if you think about it in terms of absolute numbers, if you have one huge sorted run of a column store, you can probably write to that thing once a day. And that's a pretty common pattern with sort of large sorted data warehouse implementations that we see. With just the log-structured merge implementation, you can probably write to a column store something once every several minutes or once every hour. So it gets you a lot of the way there, but it doesn't let you write to something maybe every few milliseconds or every second. But if you make the smallest sorted run a lock-free skip list, you can actually do some interesting stuff. So in MemSQL, it kind of looks like this, where you have bigger sorted runs of columnar data, and the smallest sorted run is actually a lock-free skip list. And this lock-free skip list you can write to incredibly quickly. In fact, exactly as quickly as you can write to the in-memory row store in MemSQL, which is able to process something like a million transactions per second on a pretty weak machine. And what's really cool is taking batches of data from this skip list and writing them to the column store is about as fast or even a little bit faster than writing to the skip list to begin with. So batched operations from this row store to this column store are very fast and can actually keep up with singleton writes to this row store. So what does that mean? In MemSQL, you can actually run singleton inserts into a column store table. And as fast as your SSDs will take it, we can actually, in the background, flush that to a column store very, very quickly. And this is kind of the thing that enables MemSQL to work very well for real-time writes into a column store. And that is very, very significant for real-time analytics. To the LSN for durability, is it ready to log in? No. Yeah. We're going to talk about that in just a moment, yeah. So this model also extends very, very nicely with durability. There we go. Let's talk about crash recovery or durability. So durability in an in-memory system. Memory is not a reliable medium yet. So if you write a node to a skip list, it's not kind of, and you restart the machine, you're out of luck that you can't assume that that's still going to be there. And even if there is, so let's, we were talking about nonvolatile RAM, there's always a hierarchy. So today, if you look at disk, which is durable, there is still a hierarchy. And Amazon EBS is kind of the most premium storage that you can attach to a node. Actually, there's something even more premium than this, which is direct attached SSDs in Amazon now. There's S3, which is a little bit cheaper and a little bit slower. And there's Glacier, which is a lot cheaper and a lot slower. And users still want to take advantage of this hierarchy. So they'll still say, I want hotter data here, colder data here, and extremely cold data here. So even in the world of having nonvolatile RAM, we as kind of database designers and database implementers can't ignore the fact that there is a storage hierarchy and users do care about the storage hierarchy. Another note is to operate at in-memory speed. So to be able to write as fast as people expect with an in-memory database, your writes need to be sequential. And this is where that point about SSDs come in. SSDs have insanely fast sequential writes and insanely fast concurrent sequential writes. This is a big deal. So on a rotational disk, sequential writes are pretty fast. But concurrent sequential writes are not fast because there's only one needle. On an SSD, that's not the case. And concurrent sequential writes are also very fast. So the combination of those two make SSDs very, very, very fast and fast enough at sequential writes for an in-memory database to actually keep up with the workload expectations that you have in the in-memory world. So how does this work? Go ahead. How do you reach these concurrent sequential writes? Well, you can write through SSD with multiple threads and get to a certain point more performance. So that's, for us as kind of durability implementers, that's a very new, new thing. And it requires a little bit of re-architecture, yeah. What's up? In your opinion, how much slower would a in-memory SSDs in your experience, did you have to measure or? I haven't really. I can't answer that number with like that. Oh, it's 10x or 100x slower. Yeah, it depends on a lot of things. And actually, SSDs are getting faster at random writes as well. But I would say that it is true that if you can boil things down to sequential writes, you set yourself up very well. So I'll give you another example why. Sequential writes mean that you get better batching over the network. And if you're using network-attached SSDs like EBS, you still benefit with sequential writes. Like sequential writes basically boil down to the computer science trade-off of batching. And if you can get something which is fundamentally random and sort of boil it down into a batch-based thing, then you'll get sort of better performance, or at least as good performance. That's a good question. Another reason it matters a little bit less is that non-volatile RAM is coming. And non-volatile RAM doesn't matter, yeah. OK, so durability in the row store, first thing is indexes are not materialized on disk. What's up? In the world of non-volatile RAM, random writes to SSDs matter less. OK. Yeah. So durability in the row store, first thing is indexes are not materialized on disk. The reason is we never need to read indexes on disk because the working set of data for the row store is in memory. And so we don't need to write them to disk. And we can actually recover them on the fly. So we don't care so much about the fact that it would use more disk space. We care about the fact that logging becomes very, very cheap. So we only need to log primary key data on disk in a row store. So for example, we don't partially manipulate a secondary index on disk and need to record the fact that we did so in a log so that if that transaction commits, we need to undo that. The working set of data is in memory and to reconstruct the system of records state of the world kind of truth, you only need the schema of a table and primary key data because you can populate indexes based on that information. Now to kind of optimize recovery performance, we take full database snapshots periodically. And I have a picture which kind of explains why that makes a difference. And the last thing is durability is tunable to be synchronous or asynchronous. There are some workloads where data loss means something different. So most workloads, when you think about data loss, you're thinking about the worst case. So in the worst case, I want to make sure that I don't lose any data, which means that I'm willing to penalize the common case to a certain extent to be able to ensure that in the worst case, I lose no data. And for system of record kind of bank or cash register type workloads, that trade-off makes a lot of sense. How many people here have heard of Internet of Things? It's a nice buzzword. Internet of Things workloads, where you're collecting a lot of data from a lot of sensors, there's a different kind of data loss. And that kind of data loss is, in the common case, am I actually able to collect enough data that I can make meaningful analytical sort of, rather I can answer meaningful analytical questions on this data? The fidelity of an individual data point doesn't actually correlate with the correctness of their workload. It's how much data can they ingest and how quickly can they ingest in the common case. In the rare event that a machine goes down, they're willing to accept the fact that they lost some percentage of their kind of randomly sampled data and move on from there. The other reason is that this data tends to be windowed in time. And so by 24 hours from now, or by 24 hours from when a machine fails, it just doesn't matter. And in those workloads, it actually is more correct to have asynchronous durability than synchronous durability from the user's perspective. What's your default? By the default, today is still asynchronous. No comments on whether or why that may change in the future. However, the kind of financial customers that are using MIM SQL are actually running for the most part with synchronous durability, even in those internet of thing type use cases. So here's a picture of kind of snapshots and logs. Here's a log of some data. You can see we insert, insert, delete, delete. If you kind of compress this log into a snapshot, it's just one insert. And so the value of replaying a snapshot over replaying a log is, A, you can replay all records in the snapshot in parallel, because they're all inserts. And B, you kind of remove things like this insert and delete, which at the end of this kind of point in time don't need to exist. And so these snapshots basically make recovery a lot faster. And the recovery of log 1 plus log 2 is equal to the recovery of snapshot 1 plus log 2. So you have exactly the same kind of end state once you recover either way. So how does durability in the column store work? And to Andy's point, it's actually very straightforward. So metadata is in a lock-free skip list. We use the ordinary sort of row store mechanism. Column store segments themselves are huge. There are several kilobytes or even megabytes. And they're read from and written to sequentially. So what that means is we can synchronously write them to disk. And they're kind of past that threshold, where whether you do it synchronously or asynchronously, actually matters. So it's cheap in the column store sense to always synchronously write these things to disk. And finally, the writes that need to be at this memory speed are to the sidecar row store, which is just logged in the ordinary row store way. So the column store kind of the only additional piece of durability in the column store is very simple. It's just as we write column store segments, just synchronously write them to disk. What's up? Exactly. Yeah, the sort of the latency throughput equation at that scale means that you don't kind of gain a latency benefit. Yeah. Fragments of the same table B store and row store on the column store. Different fragments of the same table. Yeah, that's exactly the point of the row store column store. OK, that's the point. Yeah, yeah, yeah. So the idea is that recently hot or recently written rows live in the row store part. In fact, the row store and column store parts, or more generally, each sorted run of the column store are mutually exclusive. So, yeah. And how does crash recovery work? You replay the latest snapshot and then every log file since. There's no partially written state on disk, so you don't need to do any undoing. And the column store just replays metadata. So it populates a cache of these column or segments on the fly as you read them. But to kind of be available and online to serve queries, it merely needs to replay metadata. And replication is just continuous replay over the network. Replication just extends naturally from recovery. What's up? So during the recovery, do you recover the multiple version data or just recover a single version data? Well, that is this, right? So we only need to recover the same old version. There's no partially written state on disk. No, I think this question refers to the multi versioning aspect of things. And I guess if you do recovery, you end up with different versions. So log recovery is sequential and single threaded. So the log recovery is the only recovery where you would have two elements of the same file that you're recovering that could overlap. And it turns out that it's sequential. So you can implement it with garbage collection, but you also know that no one else is going to touch that row concurrently with you. So you don't really need to. Yeah, so for the log recovery, do you mean the value log in recovery or? Yeah. So why the value log in recovery is conditional? Because I think the value log in recovery can be parallel, right? Value log in recovery can be implemented in parallel, but it's insanely complicated. So the reason it's implemented sequentially is that you could have an insert and a delete, right? That touch the same row. And if you don't replay them, if you don't replay the insert and then the delete, you might not end up with the right state. OK. Yeah. There are systems that do it. It turns out in MemSQL because we actually store data in shards in the distributed system. We have mutually exclusive logs. And it turns out to be one mutually exclusive log per core. So there's not actually any real benefit the way that people run MemSQL to replay logs in parallel. No. OK. Yeah. I don't know when it does a root construct already. That's why. Yes. I said that explicitly if you don't see how I'm doing it. OK. So you said you have one persistent log per core. Does that mean you have one instance of MCQ per core? No. So again, so we're not going to have too much time to talk about it. But the way that distributed MemSQL ends up getting set up is that it sets up shards. And there are multiple shards per process. So MemSQL is running one process per machine, each one with multiple shards. And each shard has its own snapshot in log. So you are processing them to cover in parallel? You are, but it's not in the same log. Exactly. Yeah, yeah. So it's insanely complicated to replay one log in parallel. It is insanely simple to replay multiple logs in parallel. So they are doing side of arm? Yeah. All right. Go generation. So here's a Python program. I don't know if you can read it. What it's doing is implementing select sum of a plus one from a table. So the table is just this kind of row, class, or struct in Python from values 1 to 10 million. And this is kind of implemented in the push-based query execution model. Here's the same program using kind of analogous paradigms in C++. And these slides are available online. I'll send a copy to Andy as well. You can scrutinize the crap out of these programs if you'd like. If you go and run them, here are the numbers. So it turns out that you get a 37x difference in execution. And even if you account for compile time of the C++ program, it's 1.37 times faster to use C++ than it is to use Python. Now, this is not exactly the same ratios that you have with a database. The big difference is that even interpreted implementations of a database can make static assumptions about types, whereas Python can't. But this gives you an idea of kind of some of the benefits of code generation over an interpreted execution model. C++ running time, you've got a clock here. This is not good C++ code. This C++ code is meant to be like simple C++ that uses as much dynamic allocation as this Python code on the surface. So what does code generation give you? There are kind of two big categories that people look at. Most people, when they think about code generation, think about expression execution. So I have some tree of expressions, which I'm executing over and over again. Let me compile this into a chunk of code which executes this expression, maybe over a single value or maybe over a batch of values. The thing that people think about less but actually matters a lot too is inline scans. So you can avoid this batching altogether. And batching kind of introduces its own overhead by inlining scans and basically taking batched computations in a tree and pushing them onto the inside of a for loop. So that's an inline scan. Now one of the things that really matters, I know Andy covered this a little bit in the code generation slide, is you need a powerful plan cache. And I would argue even with fast code generation, you need a powerful plan cache. The reason is that plan caches basically let you trade off between dynamic and static overhead. So how strict your plan cache is kind of is a function of how and where you want to draw the line between dynamic and static overhead. And this really, really matters in OLTP workloads versus kind of ad hoc data exploration workloads. So here's an example of two queries which map to the same plan. So in MemSQL basically what we do is we parameterize out values in kind of some common patterns which we'll look at. So here we convert an integer value into an at sign. And this string is what's called a plan cache key. So when we receive a query, we parameterize it, convert it into a parameterized string, and basically look up this parameterized string in a hash table. Here's a slightly more complex example. So a naive parameterizer would actually parameterize each of these integers to be kind of an endless with five at signs. But in MemSQL, we actually recognize that endless are a special case and parameterize them like this. So we normalize against the size of an endless. And that's because the size of an endless tends to vary with kind of query by query demands of a workload. Here's an example which you can't really parameterize well. So this is a kind of canonical drill down query where here we're breaking down sales by region. And you want to kind of double click on some dimension and maybe freeze the region to be the northeast and look at sales by sales rep. And here another person wants to freeze the region to be the northwest and break down by product. And these generate two fundamentally different queries which, yeah, sure, we can parameterize out the region. But the rest of it might actually have serious implications for how the query is optimized. For example, maybe we have an index on rep, but no index on product. And so these are fundamentally sort of different queries. And they don't parameterize to the same plan. And until recently, this was a big problem in MemSQL because until recently, code generation was very, very slow in MemSQL. So let's look at some generated code. Here's a very simple query. Select can cat a foo and bar. And these are based on the kind of old code generation in MemSQL. So what you'd see is it takes about 0.81 seconds the first time. And this includes code generation. And then obviously the next time you run this very simple query, it's negligible. So a little snippet of code looks like this. So we have some variable for overflow. We set up to what are called var char temps, just some structs. We have a result value. And then we run this operator called up can cat. This is really easy code to generate. Why is it easy to generate? Well, we make heavy use of templates. And in fact, this thing is templated as well. And does type inference based on template instantiation based on the types of all three of these values. So this code basically lets you, as a compiler implementer, separate concerns very easily. So you have the types of your inputs, and you have an operator which handles all kinds of types which can be implemented separately. And all you need to do is spit out simple code like this. But it's really slow to compile. What's up? So you're basically like, smash, straight, and then how about that? And then how would you automatically make it to those specific templates? Yeah, so the way that templates work, you can have a generic implementation which makes assumptions about the input types. So say you have template with class A and class B, the way that templates work in C++ is if you haven't seen that exact set of types in a template, it literally kind of string paste a new function which has a linker symbol which is some concatenation of the template name and the types and generates code. Either that code generates because the types that you use have all of the expected methods and behaviors, or it doesn't. And so that is sort of the generic thing you can do. So if every type has some method, then no, you don't need to specialize it for strings. But if you want to take advantage of some special behavior or you want to convert through strings some specific way, then yes, you do have these specializations and templates. And so Oldman's SQL Code Gen did take advantage of specialization. Or do we end at 120 or 130? OK. So there's an old rule in compilers. You can kind of pick two of these three things. Fast execution time, fast compile time, and fast development time. So here's some examples. And I'll survey the class. So if you think about assembly, what are the two that you get? Yes, well-written assembly is definitely fast. And there's no compilation, right? Or minimal kind of maybe you convert text assembly to byte code. But who wants to write code in assembly? C++ is fast execution time and fast development time, but really slow compile time. I mean, really slow. Incentiating a template costs like 50 milliseconds in the compiler. And Python, again, fast development time, fast execution, sorry, fast compile time and fast development time, but slow execution time. JIT compilers kind of turn this on its head. So what a JIT lets you do is develop something quickly and compile it quickly, generate code that executes slowly, but over time, identify the pieces of code that need to be compiled and actually do it in the background, compile them to be faster and faster and faster. So JIT compilation, it's kind of like a plan cache, lets you start with one extreme trade-off and quickly converge to the appropriate trade-off for a workload. And so here's a layout of how code generation looks like in MemSQL. So we'll walk through the old code generation first. So you start with SQL text, you parse it, you generate an abstract syntax tree, you optimize it, you have an optimized relational tree, you plan it, then you have an execution plan. So system like Postgres stops right here. And you can actually, anything that's this parallelogram is executable. Now, MemSQL would take this execution plan and run through C++ code gen and convert this to C++ plan. This is usually pull-oriented and this is push-oriented. So this conversion is not as straightforward as you'd think, but it's very, very doable. And we'd take the C++ code and run it through slow GCC and generate very, very fast machine code. So that's how it works before. Now what we do is instead of the C++ code gen, we introduce this thing called AST builder. And it's not exactly like the printf functions that we had for C++ code gen, but it's pretty close. So it's built in a way that is very familiar for someone who writes C++ to kind of produce an AST in this high level language called MemSQL programming language. And I'll show you a snippet of it in a moment. But basically this thing is optimized very well for us as MemSQL engineers to go from a system which generated C++ to a system which generates some other programming language. And it has a number of the sort of constructs that you would expect in a programming language like if statements, for loops, et cetera. It also has some MemSQL specific language constructs that make it very nice for code gen. We then take MemSQL programming language and emit a byte code called MemSQL byte code. And this thing can actually be executed or interpreted and it's faster than this, but slower than this. So we could have stopped here if kind of that was our goal. But what we do, and I'll explain this in a little bit more depth, is we actually implement each of these op codes in C++ and at server compile time, convert this C++ into LLVM byte code. And so at query compile time, we can take MemSQL byte code and for each op code, we have LLVM IR which implements the op code and stitch it together very, very efficiently into LLVM bit code. So going from here to here is less than a millisecond. It's extremely fast. Sometimes if the query is very complicated, the optimizer is a little bit slower than that, but this is very, very fast to get to LLVM bit code. Sorry, what's the one millisecond thing? From SQL all the way to here, yeah. Yes, yeah. Yeah, and that's the point of a Plan Cache. Plan Cache will tell you whether you've compiled this kind of thing before or you haven't. Yeah, it's the same. There's no, if it's a sort procedure on a Tuesday and then we'll compile it and if it's an ad hoc query on a Wednesday, then we'll interpret it. It's any query that you run, any statement prepared, not prepared in the middle of a transaction ad hoc, send by Tableau, send over the command, it just doesn't matter. We take SQL and we go through this process every single time. New access methods, are you basically creating an index? Of course, yeah, yeah. You drop everything or you drop like a subset, which are actually, you know, what's up? So what activity do we have, of course, is the player. We're gonna look at that in a second, yeah. And the last part is we take this LLVM bit code and convert it to machine code. So this is still expensive. It's not cheap. LLVM is making a lot of very significant optimizations. The biggest one is making inline decisions. And there's another thing like converting infinite registers to finite registers, that kind of stuff. So it depends on the complexity of the query. What we found is that it is roughly linear with the amount of LLVM that we generate. So simple queries generate in less than 50 milliseconds. More complex queries can take 200 milliseconds, and outrageously complex queries can take like a second. Yeah. Oh yeah, yeah, yeah. So our design goal is to make cogeneration as fast as we can without sacrificing query performance. Because remember, we're going from, we're a real system that real customers are using and expecting to be performing, right? So if you're Acme Corporation and you're going from MemSQL version four to MemSQL version five, your queries cannot get slower. That's not at all an option, right? So we need to make sure that we are optimizing and generating code that runs queries as fast as before. Now, of course, we will make this faster and faster and faster over time, but our baseline is that we cannot regress execution performance. So let's look at some code. Here's that same select concat food and bar thing. Before someone bikesheds me at the time, I didn't update the time on the slide, but it's not pointing one second. So I just copy pasted the one from before. Here's a snippet of MemSQL programming language code that we generate. So it looks kind of like the C++ code. You can see that there's no templates. It's actually kind of, concat only operates on strings, conversions are explicit with an MPL. In this case, there's no conversions necessary because this already operates on strings. And it looks kind of similar-ish to a procedural language like C++. Now, this is too small, but this just gives you an idea of how much MemSQL bytecode is generated for this precise snippet of code. And if we zoom in a little bit, this is what the bytecode looks like. So you see operators like concat, and this is a very high-level bytecode, right? So concat is an opcode which takes two variably allocated strings as arguments and populates this kind of destination variably allocated string. There's no assembly language that operates on variably allocated strings, but our bytecode is pretty high-level. So one of the kind of, as a compiler implementer, you want at least one stage of the IR that you build to be the highest-level, kind of fully sequential bytecode-like thing that you can generate because going from this high-level bytecode to lower and lower and lower levels of bytecode is pretty easy. Going from a tree to bytecode is pretty hard. So you want to kind of make that task as a compiler implementer as easy as possible by having a very high-level bytecode. And if we look at the implementation of concat, it's very simple. So this is C++ code, and like I said, this at server compile time is generated into LLVM that can be stitched at query compile time. And so it's just concat operator. It returns a bool, which is success or failure. It has a deep string, which is a result. String base, which could be deep or shallow. It doesn't care as arguments and a thread ID and it's a pretty simple C++ code. So this doesn't look unlike this C++ code that we were generating before. So I want to stress that we had kind of a few design goals with LLVM code generation. Some of them are obvious, some of them are less obvious. So the kind of obvious one is we want very fast code generation. And then you brought up a very good point. Another sort of slightly less obvious one is we are a commercial system that's in production. We expect customers to upgrade to this version of M-SQL, which they are right now. We shipped about a month ago. So we cannot regress execution performance. Customers buy M-SQL because of execution performance. Let's up. And also, I'm sure it would be to take the integral part the first few times and then do this outside the critical part, right? Yes. Why did you not consider that? We did consider it. So there's a few reasons. There's a few reasons, right? One is it's strictly more complex to build. So it's something that we may do in the future. Or you can kind of think of it as it's strictly more general than what we have today. Another con is that the first time you execute a query, it may be slower than what a customer expects. So remember, when we showed this C++ code gen example, even with compilation time, it was 1.37 times faster. So it requires a significant amount of research to make sure that you're only doing that for queries where the execution time difference, and rather the compilation time is expensive enough that the overall execution time for a compiled query is greater than the first time you interpret code. And it actually turns out if you run a little bit of experimentation with the current implementation, for a lot of queries, that's not the case. So it's not as simple as like, oh, let's just try running it three times and then compiling in the background. You actually need to really, really make sure you're not regressing execution performance by doing that. What about designing the primitives in the memcicle programming language? So the primitives were, let me answer the last question, the last sort of non-obvious design consideration which kind of overlaps with that question. So the last thing is that we had about eight months to go from system and production that has C++ code gen to system and production that has this LVM code gen. And so we wanted to make it very, very straightforward for our engineering team to go from one state to another state. And so a lot of the constructs in memcicle programming language resemble the constructs and sort of abstractions that we used inside of our code base to generate C++ code. And so, for example, some of the things that we used heavily in C++ were things like if statements in for loops, but we didn't use do while loops that much. So do while loops don't exist in MPL because we never use that feature in C++. And another kind of cool thing is we instantiate templates at server compile time in memcicle, which generate conversions from one type to another. But the tree of conversions is generated at server compile time. And in MPL is resolved as an explicit composition of function calls for conversion. So that lets us, as memcicle engineers, reason in terms of I want to start at this type and end up at that type, which is kind of what templates are like. But under the hood in MPL, it actually generates an explicit set of conversions. You basically want this so that you can do some simple analysis on it and pass it to them when you have time. I think they're compiling a lot of stuff, right? Yeah, there's a thing. If you compile a lot. So there's two reasons, yeah. So first is most of the C++ like things that we were doing, we could do at server compile time. And with MPL and the compiler built into the engine, we can actually compile away a lot of that stuff at server compile time. Now, for me it kind of sucks because the server takes a little bit longer to compile on my laptop than it used to. But that's fine because for users that means they're not running that stuff every time they compile a new query. The second reason is that, I guess there's three reasons. The second reason is that we weren't using all of the features in C++ and in particular, we didn't need exactly the semantics that templates give you. So templates were a large, large percentage of the cost in C++ code gen. And there's sort of inexplicable costs associated with compiling C++ whose semantics we don't really sort of, those things that we're paying for with C++, we don't really need from MCQL. And the last thing is that we kind of started by trying to generate just LLVM. And they're like, oh, we need a higher level IR on top of this. So we started by trying to do that and then we kind of started implementing if statements and loops and stuff inside of this bytecode and said, okay, well, it actually would be nicer to have a higher level abstraction even on that, which kind of gives you these basic reusable constructs that we use. And so it's sort of the minimum viable programming language that you can efficiently write code generation in. We do. I mean, is it important to go into that? Nope, yeah, yeah. So we take a tree, we convert it to MPL. MPL is converted to this much bytecode and then this much bytecode is converted to this much LLVM and then we run it. So how are you dealing with the instruction cache? Like do you have issues with the size of the code footprint? Yeah, yeah, yeah. So I mean, we have a lot of heuristics about, so that comes down to the inlining thing, right? Inlining is one of the, if not the hardest problem for compilers. So LLVM has some great stuff. LLVM is tunable to have even better stuff. And we even further make explicit decisions about when to inline and when not to inline things. So that's another world of optimization. So there's kind of a fourth reason that this architecture makes sense, which is that the code that we generate is very much a reflection on how well MIMSQL will execute. And instead of using a black box of GCC, there are a lot of places, and we have a lot of compiler expertise at MIMSQL. There are a lot of places that we can more explicitly sort of tune or change how we're generating code and control how we're generating code to get much better results. So one of the sort of results in MIMSQL 5, there are some architectural changes that enable this as well, but queries are also significantly faster to execute. So in a lot of cases, especially on the column store, queries are something like on average 10 times faster to execute. Yet, yet another benefit is that it really, really enables us to generate and kind of make more and more optimizations efficiently on our code. So one of the themes in building a database that you find, especially with commercial databases, is that you bring more and more of the world into the database over time. Like more and more things that the operating system does, the database does. More and more things that the compiler does, the database does. And a lot of what databases do is sort of vacuum suck away abstractions that programmers tend to rely on and rebuild them in the assumptions of the database. What's up? When you're doing these ops, are you generating a ready vector code for them or are these ops not vectorized? So inline scans enable you to not have to vectorize operators, right? So there is vectorized execution over rows, but because we have inline scans, we don't need to make these operators except the patches of data, okay? Look, I'm talking about SIMT, not... Yeah, yeah, so LLVM can actually take the LLVM and generate it by stuff like this. Yeah, yeah, yeah. You're relying on LLVM. So again, to some degree, you rely on LLVM and LLVM is not a silver bullet, so you also need to tell LLVM how to and where to do these things, right? So you can do things like, I want this chunk to be SIMD-able and give me a compiler error if it doesn't match the constraints that you need to make this SIMD. And furthermore, I want to take this thing and compile the SIMD version of it and give me an error at server compile time if you can't. So all of this stuff is very, very sort of pluggable and not, I wouldn't say easy or straightforward to do, but very, very doable. No, operators are implemented in C++. Well, the operation was... Yes. Okay, so does anyone actually write the memcable programming language or is that just an intervening thing? So no one writes text memcable programming language. People write a C++ domain specific language which generates an abstract syntax tree that can be pretty printed to memcable programming language. So you write kind of like a fancy type safe memcable programming language which can actually do a lot of sort of type verification and things like syntax errors. You don't need to waste any time checking for syntax errors at query compile time, but that does incur a cost in C++ code gen or binding errors. Like we know that every variable is bound by the time that it's used. We know in a lot of cases we can do type inference at compile time rather than at runtime. And like I said, that makes the server compile a little bit slower, but it's worth it. We're actually doing some interesting work on that front as well, which I can talk about afterwards. So we don't have a lot of time to talk about distributed query execution. I can really briefly go over this or just answer questions. What would you prefer? All right, let's do it. Yeah, feel free to leave if you have a constraint and I'll keep talking until we get kicked out. So. I don't think we're gonna calculate from clock. Can I go somewhere? I have a class in like an hour. Oh, I'll be done way before. It's not gonna take an hour. Five, maybe five more minutes. Yeah, sure. So when you look at distributed query execution in a textbook, this is usually the diagram or the kind of diagram that you look at. Now, when we first started building distributed query execution, we didn't really know what we were doing and this is pretty scary. So we were like, okay, let's throw this away. Let's think about it a little bit easier. So in MemSQL, they're kind of two tiers of computation. And these are logical tiers. They can also be physical tiers, but for the sake of this discussion, think about them sort of logically. There are aggregators, which are responsible for coordinating distributed queries and there are leaves, which are responsible for storing and computing over slices of data. So you send a query to an aggregator. It figures out what it needs to do across the cluster, computes across the cluster, kind of returns results back to aggregators or talks to the other leaves. We'll talk about both cases. The aggregator mudges together the results and sends them back to the client. So this is sort of the generic flow. Input SQL goes into the aggregator. It gets converted to some partial SQL-like thing, which the leaves execute. It's actually some kind of super set of SQL. The leaves send partial results back to the aggregator. The aggregator combines them and then sends final results back to the client. Now, sometimes leaves also need to talk to each other. And so there's also this kind of intermediate step of data shuffling. Is that done through SQL and networks? Yes, and we'll see how. But it's not normal SQL. So it's much easier to reason in terms of shipping SQL. We get a lot of flack about this. Why are you sending SQL around? That sucks. SQL is just a serialization of an operator tree. And if there's something in an operator tree which you can't express in SQL, you just change your internal SQL to support it and deserialize it on the other side. Another sort of similar argument is, oh, well, doesn't that incur a lot of cost because you need to reparse the SQL? It turns out that the plan cache actually helps you a lot there. So it doesn't really matter that much. So this is the implementation of a very simple query, selectCountStar. So you send the query. MIMSQL is MySQL compatible over the wire. So you use the MySQL client library, client program or any tool that supports MySQL to send queries. Here's selectCountStar. You send it to the aggregator. The aggregator says, well, what do I need from the leaves? I need selectCountStar over the slices of data that they own. So it sends selectCountStar to the leaves. Each leaf locally executes selectCountStar and the sort of union of those results looks something like this where there's a partition which is a slice of data and a count. And the aggregator generates a SQL-like plan which is just select some of count from these leaf counts. And that's the final result that it sends back to the client. So selectCountStar is going from distributed optimizer. Very leaf tier is going to compile that in LLVM. And then just the sum of error that's LLVM as well. Yeah, okay. Yeah, so these guys are generating sort of mutually exclusive code. That's over here. And it can all be done in parallel and there's all kinds of goodies and optimizations with making that part faster too. So here's a slightly more complex query. It's hard to read and I don't expect you to actually read this stuff. The point is that here we send this complex query to the distributed optimizer which tells the leaves to do two things. One is, hey, you're gonna have to reshuffle some data. Here's how you're gonna need to reshuffle it. Start figuring that out. And hey, you're gonna need to pull the results of some reshuffle data. Here's how you need to pull it. Start figuring that out. The leaves kind of do both of those things. Once the second thing finishes, you can merge results back on the aggregator and send them to the client. And then here's an even more complex query where it's so big that you can't even kind of fill in these individual spots, but you can get the idea that this becomes more and more and more complex. And Andy and I were talking about this earlier. A lot of the queries that we run in MemSQL are like 100 times as long as what you see here. And these diagrams are 100 times as big. So there's a lot of stuff that we do with extremely complex SQL. So what are the abstractions that this kind of, what is the vocabulary that we use at MemSQL and distributed query execution? So there's a distributed query plan that's created on the aggregator. And what it does is it glues together layers of primitive operations. So there's full SQL available on each leaf. So you can tell a leaf to do something like, hey, select count star over your chunk of data or run this group by order this join over the data that you have locally. And there are also these two abstractions that we added. They're not real SQL. They're kind of operator tree serialization SQL that are the primitives for moving around data. So remote tables are basically how a query pulls data that's reshuffled. And result tables are sort of how you pre-compute and push data on the other side. And we'll talk about both of those. So really quickly, the SQL primitives are pretty simple. Their query is over physical indexes. You can hook into global transactional state. In general, each partition of data has full SQL, so you can do all kinds of cool stuff. Gives you access to row stores and column stores. And this is an example of that kind of query. So here, select t.a, t.b, sum of t.price from t. This t refers to the local sort of physical table on the leaf. If there's an index on c, you'll be able to use a local index to seek here. And the sort of traffic that you send back to the aggregator is one row per group, not one row in t. So a lot of systems don't push down anything. They'll just pull all of t to a coordinator node and execute this query. More systems will, or sorry, fewer systems will push down filters like this. And even fewer systems will push down groups. But the fact that we can push down full SQL means that we have a lot of flexibility in exactly what computations we push down to leaves and which ones we reserve to do on the aggregator. If you, I mean, so you do a query optimizer, then you generate, you know, a SQL curve, a narrative, or a compare. And then, do they also do optimization as well? So we can actually control what optimization we want them to do. Places that we don't want them to do optimization, for example, there's some global optimization that we do which determines how we reshuffle data. And based on how we reshuffle data, we often know what order we want the joint to occur. So for those places, we can plug hints into this SQL tree which basically tell the optimizer, don't touch the subtree, let it be like this. And for places, for example, if this were where t.c equals 1000 and t.b equals something else, actually deciding, based on the cardinalities of that particular index and those parameters, which index it should use, we leave to the leads. And actually leaves in that kind of case can generate plans which make that decision at runtime based on the value of those parameters. I didn't talk about that here, but basically we use the upper towers of the skip list for cheap estimations. So remote tables are pretty cool. They let you address data across leaves. And it's basically a SQL interface plus a custom shard key. We'll look at an example in a moment. But remote tables let you express, not performantly, but express all of the parallel execution primitives that matter in the distributed system. The big one is reshuffling, but there's also stuff like merging data on group keys. And left joins are, I don't know how much you guys talked about, distributed left joins, but they are a complete nightmare. So remote tables give you some kind of primitives you can use there as well. So here's an example of a query that joins a table t and a table s together. And here we're joining t.a on s.b and we cannot make the assumption that all values of t.a and s.b that match are on the same node. Now sometimes you can, if they're the same type and they're sharded the same way, then yes, you can make that assumption. But in this case we can't. And so what we do is instead of referring to s, which would be the local version of s, we refer to this remote version of s. And a naive implementation of this would pull all of s onto this node and perform this join, which would give you correct results but be excruciatingly slow. But again, this is, I'm showing this to you for the purpose of expressibility. This is not how it's implemented in msql. Here's another one, which it may be a little bit faster. So here we know that t, kind of the assumptions that we just stated aren't true, but t is sharded on a. So you know that if you dynamically shuffle s to be sharded on b and b's type is the same as t.a, then each partition of t only needs to pull one hash bucket of values from s.b, in particular the same hash bucket of values that you have for t.a. So here you can also express shard keys in remote tables. Now if you implement remote tables in sort of the naive way, every shard talks to every shard and reruns this computation. So you get something like quadratic performance over something that should be linear. And that is not obviously good. So there's another implementation called result tables, which allow you to figure out how you want to share and cache the results of these pull-based remote tables. One way of thinking about it is remote and result tables together let you express these distributed computations as pull-based operations, but under the hood are implemented as push-based operations. So result tables are shared, cache results of SQL queries. And basically what this lets you do is share the result of a scan and computation across readers. They also support streaming semantics. So result tables are kind of similar to temporary tables, but you can say things like, I know I'm only gonna read from this many times and I only need you to keep around a thousand values. And again, technically this is an optimization. So you can express all of the primitives that you need in terms of remote tables, but they're prohibitively slow without result tables. And finally, a result table and an RDD and Spark are kind of the same thing. It's this read-only temporary table that you might use some number of times so it may or may not need to be materialized. And here's a simple example of a result table. So this is create a reshuffled version of T as the result of this query and shard it by this thing. And then if the shard key of a remote table and the shard key of a result table match, then you only need to read a small subset of pre-segregated data that's ready to go. And finally, optimization and query optimization is a huge part of MSEQL. So there are all kinds of single machine optimizations, which to your question earlier, a lot of them we push down to the leaves. So we let the leaves figure out which local index makes the most sense to use for a seek, how exactly it wants to sort and group data. We also have SQL to SQL rewrites, which are kind of the most important building block for analytical SQL. We have a cost-based distributed optimizer. Cost-based distributed optimization is kind of the new very, very hard and interesting problem in query optimizers. Commercial implementations are constantly evolving with cost-based distributed optimization. And the big decision that you make is do I broadcast, which is send some result to every node or reshuffle, which is do I send some subset? And there's some cases where you actually get to make a trade-off, and there are many, many more. So query optimizers are nowhere near a solved problem, and there's a lot of active work. We found a very high correlation between people who love math and people who love programming languages, like CMU-style PL to people who really, really like working on query optimizers. So the people at MSEQL who work on query optimizers are kind of a combo of PL people and math people. I think we're out of time for this, but we covered a lot of cases. So I can answer some more questions or talk about this offline. I'll go to the last slide for, there's some logos. So this is not bullshit. And here's some takeaways. There you go, oops. And you love Waterloo? I do love Waterloo. Sorry, I forgot to update that. Last time I gave this talk was at Waterloo. It's a take of time, we're just gonna cut it off now. And then there we go. I'll say one of the two good things. One, it's been awesome having you guys in the class, classin' that down in the exam on Wednesday, and the five presentations, but it's been a blast. And then the other thing I wanna do is everyone should be thanking Buck for doing an awesome job videoing everything. We're gonna thank him. I have Buck, you've got your disability for predatory, so you can go on and be predatory. All right? All right, see you on Wednesday, thank you.