 All right, guys, let's get started. So we're super excited today. We have a guest lecture, Dr. Yixing Zhang from VoltDB. He holds a PhD in computer science from the University of Houston, and he's currently the lead engineer at VoltDB. They're also giving a research talk later today at 430 in Gates. If you can't get enough VoltDB, then you come to that. So full disclosure, when I was in grad school doing my PhD, I worked on a system called H-Door, which you can talk about today. And around 2008, they took my code, forked it, removed all my bare vanity in the comments, and then that became VoltDB. So that's sort of the background, and we're super excited to have him talk about that effort. How do we remove just your comments? Yeah, not only do these remove the head of scrubs from the Git history as well, because if you don't, yeah, I knew that that. OK, so on the way out, Tenu and Lin will be handing out their coupons. So make sure you pick up that, and you can only get one. OK? All right, go for it. Thank you. All right, thank you, Andy. Hello, everyone. It's a great pleasure to be the CMU and see all you guys here in Andy's database course. So a little bit about myself. My name's Ethan. I'm a software engineer from VoltDB, where I lead a small agile development team working on VoltDB's query execution engine. On the screen are my contact information, my email, LinkedIn, WeChat, whatever you want. Feel free to contact me with any of your comments, or questions, or feedbacks, or even real resumes. I'm also hiring interns for Summer 2019. Here as well is my colleague, Ruth Morganstein. She's the VP of Engineering of VoltDB, and I bet she also has many interesting stories to share, especially if you want to talk about some career topics like women engineering. Definitely, I'm not qualified to talk about that. All right, so today all my stories are about HSTOR. HSTOR project is a collaboration among researchers from several universities, like MIT, Brown, the CMU. So it sets forth a specialization of her database for fast transactions. And HSTOR was commercialized as VoltDB in 2009. And since then, both systems evolved a lot in their respective rooms, namely the academia and the industry. So today, we will start from the academia. I will talk about the VoltDB history, what motivated VoltDB, and then some design decisions, how we made them. And then I will give you an architectural overview of this original system that we built. Following that, we'll set our fit into the industrial world. And I'm going to talk about some of the things that we had to change after we started shipping VoltDB to real customers. And in the final part of my talk, I'm going to come back to academia again. And we're going to talk about, very briefly, some of the more recent research work followed the HSTOR project after HSTOR and VoltDB diverged. So let's get started. If you followed the Michael Stonebreaker history, one of his core idea was to build a specialized system to solve specific problems. For instance, if you want to be really good at old app, you want to run long running queries, you want to run read-only queries, complex drawings, or exploratory queries like old app cubes, you want to drill down and roll up and slice and dice. The system you should be looking for is systems like CSTOR or the Vertica project. Vertica was this idea that if I built a system with some special designs, like I use column store rather than row store, I use some data compression, mechanics, and et cetera, just for doing analytics, not as a general purpose system, can I run 10 times faster or 50 times faster? The answer was yes. And that project came up to be an academic and a commercial success. So there was a bunch of people then turned their head to the other half of the question. Now we solved analytics, can we do better transaction processing? So one of the specific questions asked was what kind of system can I build if your data fits in memory? You may become suspicious about this because living in the world with some giant companies like Google, Facebook, those may be a reality distortion. But in reality, a lot of the data sets do fit in memory. For example, the TPCC benchmark is one of the benchmarks that we use for OLTP workloads. It simulates a warehouse order processing situation. And the largest data set in TPCC at the warehouse, actually each warehouse got only 100 megabytes of data. And if you own a business that has, let's say, you time this by 1,000, you have 1,000 such warehouses that only goes to 100 gigabytes, which can still easily fit into the main memory into this cluster environment. And actually, Vodibi now has customers that go with data at a scale of terabytes. So a lot of data sets can fit in memory. So with that, how do you figure out that problem? So what they did was they took a legacy system, one that was open source, and they shoved it to the RAM disk and measured where did they spend this time. So you know, when you run a disk-oriented system on the RAM, it definitely gonna make it faster, but it won't run like 50 times faster, right? If you have MySQL and I give it a huge memory cache, that doesn't make MySQL an in-memory database out of the box. So it won't run as fast as the RAM makes it, right? So let's just take a look where does a legacy system spend its time. So one of the things that's taking a significant amount of time is the buffer manager. But that problem is very easy to solve because we assume that your data fits in memory and you don't necessarily have to use a disk to store all your data. You can get rid of that disk storage and store all your data in memory. Then you don't have that buffer pool and don't need to have a buffer manager. So that problem is kind of solved. You cut it off. So the next big issue that's problematic is dealing with concurrency. So there are two aspects to that. The first is to have shared memory access. Then you need to have concurrent data structure that can current B-trees, which incurs a lot of overhead with all those mu-texts and logs. The other aspects is actually the logical logs for transaction management. Like we covered in the course, you know we have real level locking or table level locking or various data structure that has latches. So that's also overhead. And those are implemented in user-based, not in the kernel level. And those are combined together as things that prevents the system from being like 50 times faster. So how do we solve that? Can we get rid of them? So before moving to that, let me emphasize one of the key characteristics of the OLTP workloads. I think we also saw this in this course is that OLTP transactions are short-lived. They are pretty short. So another example from TPCC is that the heaviest transaction in TPCC touches like read and writes about 200 records. And it can be done in about less than one millisecond. So in such short transactions you can actually just have a system to run all the queries within that transaction sequentially from the beginning to the end. And in those scenarios more often than not, CPU is not the bottleneck for it. For many times the network becomes the bottleneck. So after that, so how do we solve this problem? The answer is that we come to the same conclusion that many crazy NodeJS people came to is that REN singles Reddit. It's not where a lot of people do that. Reddit does that too. And it's actually astounding to see what a modern CPU can do if you build a special purpose single-thread system. So also running a single-thread system has its own benefit because it's very easy to write. Everyone starts learning how to program, how to write a program with single-thread program. You don't start to learn how to write a multi-threading program. And when it comes to the deployment, actually it's easy to deploy, it's cost-effective. You can deploy this kind of software on a low-end machine. It's like you can have a single-core machine to run it. So to put this in perspective in the class, he's basically saying throw away project one, project two, project three. That's what he's doing. Two loves the flat cracker, right? Don't do it, you don't have to do it. All right, yeah, that's a good way to see it. So with that, you sort of can consider that we have to find a way out of the concurrency problem. But not yet. There are two remaining issues here. The first issue is that waiting on the users leaves the CPU idle. What does that mean? If you support hibernate or if you support what we call external transaction control, is that you tell the system, okay now I'll start a transaction. Here are the queries I need to run. Run those queries, give me the result and I will determine, I will tell you whether to commit this transaction or I will abort it. So this is external transaction control. If you're doing things like that, you cannot really run single-threaded because you ended up waiting on users to give you an answer, whether to commit or rollback. Right? And then you spend a lot of network round trip time on it and the performance cannot be good. You cannot run 50 times faster. The other issue is that multi-core is now in the future. I think even your cell phone has multi-cores now. And if you have a system like VODB that's transactional and does distributed transactions, you don't really want that system to just run single-threaded. You need to do a little bit more than that to utilize all the multi-cores. So how do we solve that? So here reminds you the other characteristic of VODB TP transactions. Actually this also comes from Andy's course. You probably saw this in the lecture is that unlike OLAP transactions, OLAP queries are more ad hoc. You have the analysis sitting in front of the terminal. You have data in the data warehouse and you do ad hoc queries to explore the data warehouse to discover some interesting trends. That's what OLAP do. But for OLTP transactions, the queries are more of a query pattern. It's a more repetitive. For example, if I have a transaction that's transferring some money from one bank account to another, the queries that you need to execute, they are written, they don't change. And the order of the query execution also fixed. You have the logical flow. If this, then execute that. If that, then execute blah, blah, blah. So then for the external transaction control, we don't really need that because you know everything advance. So you can turn this process to tell the system to start a transaction to commit a report into something that we say is a pre-compiled store procedure that has all the flow control code, the parameterized SQL because what varies from transaction to transaction is just the parameters. If you transfer an amount of money from one bank account to another, you'll need to know the account number and the amount of money you're gonna transfer. So that's the only change from transaction to transaction. That's the parameters. So for the way down here, the external transaction saying the problem, the solution is that we don't do it. We don't provide the user the ability to start a transaction, run some queries and commit. And instead, we will move the logic from the client side to the server side. It's like we push the logic to the data. If you use external transaction control because you have introduced this waiting time or the network round trip time, so you have to introduce concurrency to hide that latency and the performance cannot be as good as having all the server side logic. So we'll push the logic to data, now the other way around. So how do we use all the cores on the multi-core system? Definitely I'm not gonna run a system on one single machine, right? Especially because I'm running a in-memory database. If I need a database that has like a capacity of two terabytes, I think I probably need a bunch of servers to form a cluster to store that. So partitioning in that case is a requirement. And then, so how do we partition that? Do we partition that to the node? It's better that we partition the data to the cores. So if I have four machines with four cores, I better treat them as like 16 tiny little machines rather than treat them like four machines and have some shared multiprocessing. So we view this as concurrency via scheduling, not via shared memory. So if you look up VODB on the internet, you see documentation from us, you see okay, VODB is single-threaded and VODB does not have logs. Okay, that sounds confusing. So you don't have concurrency? It doesn't say that. Actually, what we're actually saying is that when you run SQL queries, when you run SQL queries, those are run without the logs. You don't have shared data structures and you don't have logs, but we do have concurrency code that managed to distribute the query, the transaction to the right place. So the concurrency is achieved via scheduling, not shared memory. We don't have those shared data structures once the execution starts. All right. So I crossed out many components here. One thing that's left is logging. That's what everyone's working on now too. Oh really? You have a project on logging? Yes, but you can't get rid of it. So I know they want me to say you can get rid of logging so they don't even care about doing that project, right? Well, there are two parts of the logging, right? So the original HSOAR project is the point that the durability can be achieved through data replication. So you probably covered, you know the red head logging, the WAL. So it has, what do you store in the red head logging? You store the before values, the after values for redo and undoes. So for the in-memory database system, you have durability through the data replication. You don't need to worry about the redoes actually because you have the other replicas have that for you. So that can be solved, but what about undo? Running single partition transaction does not mean that you cannot abort a transaction. You still need to have some kind of logging for the redo. So I can take some part of it off, but I need to keep the undo logging. So there's a tiny piece in the execution engine where I keep track of the changes I made in a transaction so I can undo it transaction. All right, that's basically all of it. Then let's just take a final look at what we end up building. So first is in-memory relational SQL database and there's no external transaction control. We use stop procedures and execution engines are single threaded and those are run in parallel to use all the cores and repartition the data to the cores. And we have concurrency via scheduling not via shared memory. We don't have shared data structures. And the durability is achieved through replication and we have serializable assets. That's the strongest level of isolation, the serializable isolation is you run the concurrent transaction as if they are run serially. All right, so let's see more concretely what does the system look like. So VODB is remarkably simple. So clear your own mind with all the complexities, forget about your projects in the database course and it's kind of a Zen moment that we are going to have. So I want to picture this image together with me. First, it's going to be an in-memory star. So you have all your data in the memory and then to operate those data you have a execution engine, a single threaded execution engine. So we have the data, we have the engine to manipulate those data. Then I will give it a work queue so that it can have a place to pull off transactions to execute. And then most important part is emphasized this system running parallel. You have multiple this kind of single threaded execution engine. So we have it parallel. So those are partitioned to the cores and those cores can come from the same machine and come from multiple machines. So I'll show you a setting that has two nodes and it's the dual core machine. I have two sides per host. I have two logical partitions on every node. All right, now I have a table. Okay, so if I run, I store this table and then run transaction on the table and how do I access the data? So it depends on the data layout, how you store this table. We have two types of tables. The first type is replica table which every logical partition has a identical copy, the complete copy of that table. So I distribute this copy to all the partitions. Every partition got to the complete data. So in that case, to route those read and write requests I'm gonna have something like a command router. And if I read something from replica table, where do I route this query? Everyone has the same copy. So it doesn't really matter where do I route it? So I can just pick one and so that can serve the read request for a replica table. How do I write? It's more complicated because everyone has the same copy. Then if I write something, then the changes need to go everywhere because we need to make the database in sync. So those changes go everywhere. All the partition got that request to write a replica table. And then you get the confirmation the write was complete, successful. So the other type of table is partition table, meaning that you actually partition a data across the cluster and we have a consistent hash function that calculates. So every table you need to specify a column as the partitioning column. In this case is the player ID is the partitioning column of this table. Then I use this hash function to determine the hash value and to tell me which partition any row is gonna be routed to. So I represent that as four different colors on the graph. So I distribute the data to its corresponding partition. So in this case, I still have this command router. And if I wanna do the read, a select query, remember that the player ID is the partition column. If I have this select query that has this partition column in the filter, say I wanna have the information where this player ID is 687. And you know from the left-hand side which partition it should go to. So the command router calculates that hash value and then it will route it to the corresponding partition for the single partition read. And then we will return back the result. So what about single partition writes? It's pretty much the same process. It's just writing like I have an update query. I wanna update. And if I have this partition column involved in this query, then the router knows where do I route this command to? So again, I route to the corresponding partition, does the update and get back to you. All right. So there are times that you don't have the filter. You don't have that partition column involved in the filter. So in those cases, you don't know exactly which one single partition that you're gonna execute this command on. So that becomes a multi-partition read. Say this is an example. Then I don't have the partition column. Then I don't know exactly where to. There's no one single partition that can just execute the complete command. So I route it to every partition. So every partition gonna say, okay, I have this. Let me check if I have any matching rows for you. So two of the partition has matching rows. The other two doesn't. It doesn't really matter. Everyone's gonna reply with me with a message. It's okay if it's an empty message. I don't find any matching rows, it's okay. But the router is gonna combine it together and gonna route it back to you. For multi-partition writes, it's the same idea. You have some queries and you don't know which partition to go to. You put it on all partitions and every partition is gonna do the right. And then it'll get back to you. It's okay if the response is empty. You don't do anything. But if you do anything, just give me a confirmation. All right. So in the multi-partition transaction, actually there are multiple partition involved. So actually you cannot just directly return. As I mentioned, there's command router that combines the results together and return to the client. Actually if any of the partition fails, you have to fill the transaction. So there's kind of a two-phase commit. And that actually, what VODB adopts, actually also what HStor adopts is that you block the transaction until everyone finished, everyone reached the same point. And then introduce some stall. That's efficient. For single-partition transactions, you receive from the client and the coordinator, the router, gives you the command and you can just queue them sequentially. And you can see actually the execution engine is really busy doing all the single-partition transactions. But for a multi-partition transaction, when the router sends out the request actually, because the multi-partition is gonna send to every partition, then every partition is gonna send the prepare message after it's finished. The transaction say, I'm ready. So everything goes well and I'm gonna commit. And the coordinator needs to get it back to the partition and say, okay, then let's just commit. So then you need to wait for the network round trip, in that case, and that introduce some stall, then it becomes less efficient compared to the single-partition transactions. And actually here we are doing a earlier acknowledgement. So if everything goes well, if I receive the prepare message, I would just directly return the result while I give the partitions my confirmation to commit to the transaction. All right. Do people, can people run the command router on the same machine as the data node? Is that a machine, is it a node? Or is that a dedicated machine? That's the, what do we call this homogenous. There's no master node or a coordinator node. So everyone can route the, I'll just do the abstraction to have that router, that everyone can do the routing. All right. So, so far I did not cover anything about durability that replication part. So in the example I just gave you, this database is not durable. So if I have this read transaction, I wanna read something and I know this partition is there, but before I send a request to that partition, that partition's just subtle in diet. Maybe it's the power failure or whatever, I don't know. Then you don't have that data available. That data is completely lost, right? You have to cry because you have no backups, right? So recall the multi-partitiation read. If I read a replicate table and the partition I go to is just subtle in diet, can I get the response? I can still find another partition to read because everyone has a complete copy. Then I'm not that stressed because I have replication. So the similar idea, due to space limitation I reduce this to two partitions. So I can make a complete copy of those two partition on another node so that everyone, so each partition got its replica. Then I have the same query and then which one do I ask for the data? In this case we define the partition leader. So I just have a master in the replica. Then we always turn to the partition leader if it goes well for the data. But in this case, if I make the node to die, then I still, I found that no do die and can I respond to my request then I would just turn to the replica for the read. So that's replication. So it's so funny, Andy's course that he says they are active active or active passive. So will they be used active active meaning that you both the partition leader and the replica execute the transaction? And we're gonna check at the end if the result matches. So this is the illustration of that active active replication. So execute. So when we route a transaction to the partition leader, we don't route it from the coordinator to that replica. It's actually the partition leader is responsible for routing that replica, that transaction to the replicas for them to execute. And we still have the partition leader will wait for the replica to finish then we can commit to the transaction. All right. So if you recall the diagram for the multi-partition they also have stalls because you have to wait for the network round trip. So does that mean that a single partition with active active replication is as bad as and multi-partition transaction actually is not in the K-50 environment setting. So K-50 means that if I can tolerate K server loss. So the normal practice is I keep actually K plus one copies for any partition. So actually the replication of blocks like K plus one partition because every partition needs to participate that transaction but MP the multi-partition transaction actually blocks all the partitions. So that's the difference. All right. One thing that I need to emphasize for active active replication is determinism is that given a same starting state and if I run a command as a SQL query or transaction what I expect is both the master and replica run into the same result. I don't want the data in both locations diverge after running this transaction. That's called determinism. So how do I guarantee that you have the same result? There are several factors that may cause the data diverge. One is the query order. So you have, if I have to position which are master and replica and I create a table both got created and I insert a tuple I insert a row into the database and then starting from the search query I have a different order. One is insert into a table of value and then the one is the update query that updated the data. And then we switch around then say I will do the update to the end and this one will do insert to the end. Then because the order of execution is different you end up with different the table content then that causes you to diverge the data. The second factor is the tuple order. Same setting, if I create a table and I insert a tuple and then for the first case it serves as the first location and the second table inserts as second location. That's totally possible. Probably because I just happen to have a available memory spot at the second row location because I just recover from a memory compaction. It's totally possible. So we don't guarantee that the first row inserted appears at the first row and the second row appears at the second row. So if I continue execution, if I insert another row it's fine. Both table contains two rows and you have one and two. But the problem arises if I execute some query like delete from table limit one then the first table say okay I can delete the first row and second table say yeah I can do that too. But the row that deleted happened to be different then your data diverged again. So that's tuple order. So in VODB we enforce the tuple order. Sometimes we give determinism warnings if we detect your query may have the risk of diverging your data then we make it warnings. And for those cases you can just add an order by clause to enforce that ordering so that you can just have a deterministic result. Last factor is a function determinism. Say if I have a table that I store the timestamp maybe the produce time, the production time I want to insert the current date. Then today midnight before the project due time you insert the current date time it's still 11 p.m, 11.59. So if I insert then I have December 3rd. But right at the moment when I forward this request to the replica then the clock switches now it's December 4th. It's 12 a.m now. Then you end up inserting December 4th. Then the data diverge again because you have this function on non-determinism. All right. So in VODB we solve this problem by providing some of the functions, the implementations like the current date and the current time or the random function, the random number because we can substitute that with something that's deterministic. So actually internally in VODB when you execute a query like this we're gonna translate to something that's constant so you don't have any confusion between different sites. So actually we guarantee that the result you insert into the database is deterministic. Why couldn't the router just say this is like current time span? This is something that we do. Like for the random function we give you a C that's identical for all the partitions and for the current date time and we'll give you the time. So you can insert that. I mean sort of what you're implying here is you're rewriting the query to make today function V the constant. No, it's not like that. It's not like that, yeah, okay, sorry. I think this is actually super important if it won't make VODB actually work. So what he's doing here is the active-active replication means that the transaction's gonna run on every copy of the database that you have on different nodes and then you don't wanna have to coordinate while they're running, right? They're running independently of each other. So all of these aspects of non-determinism would cause the two machines that are running independently the same transaction to divert. These are the things you need to do to avoid having this problem. So that's always guaranteed that the database is in the same state when you start it'll be in the same state of both machines when the transaction happens. Yeah. All right, so far we covered almost all of the part here on the diagram. We leave the essential part as for database operation and we leave some part into logging. All right. So that basically is the architectural overview of the original HStore system. So now let's move on to the next part. So after HStore was commercialized as VODB and what kind of things we have to change except for the logos. There are so many logos. It's kind of surprising that startup company has so many logo changes. And I joined VODB as intern in 2015 and I came back as full-time engineer in 2016. And I only saw the red one on the top left and this one and another three are predates me. But Ruth saw many more than me. So the first thing that we had to change is the durability because no one's really interested in a real completely in-memory OLTP system because they don't feel safe. Like they don't believe it that well, you can do that just by active, active replication. So that's a huge gap between the Syrian and the reality, how people believe things and how you advocate things. So what we built into the system was command logging. That's part of the disk durability. So what we do is that we actually logs the commands that we execute, that we send to the partition to execute. We log that command to disk to make that durable. Because I emphasize that this determinism is very important in VODB query execution. So also benefit from that because you have a same starting date, a state, and you run this command log and you replay from the command log. If you have the same starting state, you're gonna end up with the same ending date. Then you can bring your database to the latest date. Also we utilize this serializable isolation as a performance trick, not a performance compromise. Command logging is logical logging. You're not logging the SQL query, you're logging the name of the store procedure. Think of like, you're logging the functions you invoke and then that function invokes bunch of SQL data. Yeah, so I can just log a procedure one. But that may contain like a thousand SQL queries. But all I record is just the name of that store procedure. So why do we log the command, not the changes? Because it has bounded size, so you only have the parameter name and the list of parameters that you're gonna pass to. So you can imagine how fast this is by comparing the disk IO throughput and the network bandwidth. So when we compare memory and disk, we're gonna say memory is super fast, faster than the disk. But how's the result if you compare disk throughput with the network bandwidth? So basically you can actually record as many requests as you want, as you receive from the network. You have this ability to record all those commands to the disk. However, if you just record the changes, I can just change within the transaction a million tuples and that will be a tremendous amount of updates you need to log. And also another point for that is about latency. So let's take a look. If you use write ahead log, you need to start to log and you need to wait until the execution finish because you need to log the before and after add value for redo and undoes. But for commands, well, the commands doesn't change after you execute the query. The commands are the commands. So I'll move it to the left side because I don't need to wait for the execution result to have the after values. I can just directly, once I receive the command, I can just give it to the command login, just log it, right? And also this is synchronous command logging. If I want to be a more aggressive, I can make it asynchronous command logging. So you don't need to check if the command landed on disk, this flush disk before you can just return to the client. You can just, because you don't care or you care that less about that, then you can use asynchronous logging, but it can still give you this to the disk. What percentage of the cost for a problem is asynchronous? Well, let's answer that. Most of them. Most of them. Yeah. All right. So if you overlap this execution, then it's also a kind of pipeline process. Do you have this asynchronous command logging? But we also created a, oh, this is not up to date one. So we also created the back pressure mechanism in case that you run too fast on the transaction and it's giving a disk hard time to catch up. So we also have that back pressure from the command logging to the execution engine to slow down and wait for me. All right. In addition to that, we also have something called the snapshot because check point snapshot, because at some point you don't want to, if the database crashes and start up again, you don't want to replay like several millions transactions or tens of a million transactions. That takes a long time. Instead, we're gonna truncate this command log periodically by doing a snapshot, meaning you dump all the table content was stored in the database to the disk so that the two mechanisms are combined together. So we have periodic snapshots that takes the full snapshot of your database. And then, because it's periodic once, if your database crashes and then start up again, what you do is you just look for the most recent snapshot and restore the content because the command log is truncated from the point that the snapshot is being taken. Then you first restore the snapshot. It gives you all the tuples in the database. Then you start to replay command log after that. That gives you back to the most recent state in the database. In a snapshot, we have something called, this is my own name. I call it true version control, true version concurrency control because once you've started a snapshot, then the part that you are going to read for snapshotting may still be changed by queries, but you don't want to have that directly flushed to the snapshot. Then we created what called copy and write context. You can treat it as a true version concurrency control instead of multi-version concurrency control so that you keep a original state of that data before the snapshot point. Then after the snapshot is being finished, they can be merged together. All right. The second thing that we added to the VodB system is cross-data center replication. So far, the replication active application we talked so far is within the database for durability. Then if you think about this restarting process, if I started the database, for disk-oriented space database, you have all the data on the disk and you just replay the logs. It's a considerably smaller amount of time. You have terabytes of data or even just tens of gigabytes of data and restore from the snapshot. It's gonna take a while. And sometimes people say, I don't want that long restore time. Can I just have a standing by cluster that it can promote anytime if the primary cluster failed? So another use case for this is that if I'm operating a bank and I have branches in the US and I have branching in China, say in Beijing, and if I make a cache withdrawal from the branch in Beijing, I don't want to have that long waiting time, the round trip time. It's about like a 200 milliseconds latency between the communication from the Beijing to maybe Washington. So I wanna have Joe sparse the data centers that it can just serve. Request are close to me. So those are the two main use cases for the cross-data center replication. So we build that and we support active, active and active passive. Active, active in this case means both clusters, the primary and the replica can accept transactions. You can make cache withdrawals from Beijing and route it to the Beijing data center and you can make cache withdrawals from Washington and route it to, you can go to the Washington data center. All right. This is the diagram for that. So we also have a conflict resolution scheme because if I'm withdrawing caches from U.S. and my parents using my cards to withdraw caches from my bank account in Beijing, then both sites are gonna try to modify the data of my own count which should be on a one single partition. Then I may result into a conflict here. So in this case, we timestamped that all the transactions accepted at two data centers and we actually locked them if the conflict occurs and you can refer back, you can resolve that at the application level to resolve that conflict. And in the database, always the last update succeeded. All right. The third problem that we had to make changes is the memory fragmentation. Actually, this, although I listed it on the third item, I don't have any sequence, like I don't have any order for all those items. This one actually, if you look on the temporal axis, it actually happens at the very beginning. Soon after we have this first customer, we run into this memory fragmentation issue because the customers say, well, I start a database and run my business and over time, I didn't have so much more data. I just run transactions, but you are taking much and much more memories. And even if I truncate the data or delete some of the data, I don't see the memory consumption drops. I don't see the shrink on the memory usage. So that makes us to think about the memory compaction because over time, oh, yes. Okay. Can I switch to the time stamps? Yeah. Yeah, I must. Oh, the cross-axle here. Yeah. So you mentioned that you use the timestamp to resolve the conflict. Yeah. How do you get the conflict from the question? So he asked, if I use the timestamp to resolve the conflicts and how do we get to the timestamps? So we have, I think we have the NTP service to our cross-data centers. And you cannot guarantee that the timestamps you get from two data centers are identical and they're always precise. They're always some offsets. So as I said, when the conflict occurs, we just tag the transaction with the current timestamp from that data center and then we'll tag it to the binary log for the DR and actually it's gonna log and we cannot resolve that automatically for you, although we will let the latest update to succeed and make another one fail, but we lock both changes. And it's at the application level where you address those conflicts. Yeah, you can actually access all conflict resolutions in a log and write reconciliation logic. Yeah. Because you're right, both data centers will not be totally synchronized. Right. You cannot expect, okay, I have timestamps from data center A and B and they are totally in sync without any precision laws and always some offsets. All right, back to the memory compaction. So there are two things that we did in the system for the triple storage part. We actually bucketed the blocks according to their occupancy. Here, due to space limitation, I only showed four buckets actually in the system, there are 20 of them. So we classify all those blocks for the table into those 20 buckets in terms of the occupancy. So when a transaction commits, we always try to say, well, do I reach a threshold for a compaction? If I do, then I would just like say, in this case, I would just take the bucket for the 20% full and merge it with the 80% full buckets. And for the index, because we use tree indexes, whenever we remove a tuple from the index, we always try to swap with the element that was at the end of the index and we try to rebalance it. All right. This is more recent issue that we tried to resolve is shared replica table. It's also coming from the field, a lot of customers complain about it. It's also because, as I said, multicore is now in future, it's gonna develop more and more. A lot of the servers got 32 cores and even 48 cores, example from my other customer. So it gives you more free space. It saves a lot of space, but also at the same time, increase a lot of the engine complexity. So scenario is like this, if I have a replica table, so for replica tables, sometimes we store maybe lookup tables or dimension tables or for many times, we just make the tables that are less frequently updated, replicated. So for many use cases, for many other customers, we have a lot of replica tables and that concerns them in a very high core configuration. If I replicate this table and this is the example configuration from the customer, they have a lot of cores in their machine and each partition need to have a complete copy of this replica table. That space consumption adds up very, very quickly and you see if I have only 100 megabytes of data, if you times all those nodes, all those sites per host, you end up storing a huge amount of repeated data even on the same machine because you have so many cores. So in this case, what we did was let's just make it different. I don't want every partition to store a complete copy of this replica table. Instead, I only let each node store a complete copy of this replica table data. That dramatically drops the memory consumption from replica table and makes a lot more things can happen. All right, in this case, how do I write to a table that's shared replicated? Sorry, this is not the latest version. I made some changes on the fly, maybe somehow this didn't propagate it. So you see on the left-hand side, there are four engines, every engine got one copy of it. Actually it's not only the lowest side, only one of the sites got one copy of that data. So if I write to any of the, if I write to a replica table, I have a query that writes to a replica table, I will still broadcast this transaction, but not all the sites are going to run it. For the non-lowest site partitions, marked as dashed blue arrows, they actually do nothing, they will just return. Only the lowest site, the one that's marked as pointed by the purple arrows, they're gonna do the update and gonna return back to you. As if all the partitions finish that transaction. Well, the other case is that they didn't work well, we actually spent a lot of, a ton of time fixing all the bugs caused by this, because for some reasons, it may from the snapshot pass, or it may from command log pass, or it may from DR pass, whatever. I don't recall the exact reason, there are tons of it. All those partitions do not enter because you need to make sure that all the sites come to the same point, when they receive that write request, then the lowest site can start to write it. So there, we introduce a latch there in the execution engine, so that it's not for concurrency control, it's just because we want to synchronize all the partitions to the same story point. So there are latches there, and so the code is like, every time I receive this log, write request, I enter the execution engine and we'll count down that latch, and then I will do a if statement if I'm the lowest site, and they chosen one, I'm chosen one to accomplish this mission, then I will do some write. Well, but sometimes for some reasons, not all the sites can get to this countdown latch, and sometimes they fall asleep, or they were dreaming or something. So the normal working sites were like, I'm counting on this latch, I'm waiting for it to come to zero, and it can do some work. And the other transaction say, okay, those four partitions are working on something, let's just wait, like I don't even assign any new transactions for that sleeping partition, then you'll end up stuck in this execution engine, and the current transaction cannot succeed because you're waiting on the countdown latch, and your transaction cannot comes in because say you're busy, I won't assign you a new task because you're single threaded. What the f*** are you doing? No. Some kind of romance stories. What does it say? She said long range of that, can you say what? What? Dreaming off? What does it say? You mean from, you mean what I drew? Yeah, like, like, you can say f*** to another engine? No, they're kind of dating or something, I don't know, it doesn't make something up, it's like, oh, what can we do? I'm going to assume that that partition has some long running transaction that is not related to that. Okay, yeah. Or it may be on the verge of failure. It is also, it seems to be with another gear that is different than all of it. I copy all the other ones. All right, all right. All right, let's move on. All right, another problem with this is the context, the memory context switch. This especially becomes complicated if you have operation that involves both partition tables and replicate tables. Say if I have a partition table, I call it P, and I replicate the replicated table called R. Now, I have a query, I want to join this two table. I want to join a replicate table with a partition table. How do I do it? If I am the lowest aside, I'm the one, I'm the chosen one to accomplish some significant mission. I have this data for the replicate table, and I have the data for the current partition table in my partition. I can do the join in a normal way, but what about the other partitions? I don't have that replicate table data because it's in the other partition. In order to join a table, I need to switch my memory context. I need to access the memory space of the other partition. I need to access the data stored in that partition to get my data from the replicate table. That context-switching stuff is another cause, another root of evil. We also spend a ton of time making sure everything goes right. It's not that difficult by itself. It's especially complicated when it comes into play with all the other components, major components like the durability, imported exporter, DR, et cetera, all those things. When everything comes into play, it's very hard. Is this an engineering thing or is this something worth scientific? Are the engines running in the same process? I think yes. Yes. So it is technically just a pointer to another location in memory. Yeah, it's kind of, just need to make sure that you track this context right in all scenarios. It's not that researchy topic. How do they do it right? It's come up with an efficient algorithm. It's not like that. All right. The next part is materialized views is something that enables some of the part, some of the streaming capabilities of VODB is that if I have this scenario that I have fixed the networking overhead and I have some transaction overhead, scheduling overhead, et cetera. So imagine a scenario that I have a table that accepting incoming tuples, I insert tuples all the time at some speed like 500 K per second. And at the same time, I need to maintain a dashboard or something. Like the query shown above on the top of the screen, I need to maintain a dashboard or something. All right. And from time to time, I'm gonna query the dashboard and say, give me the result of dashboard. I wanna see the current result of the latest result. So when it comes to inserting tuples into memory, actually the memory read and writes is extremely fast. They came almost for free. You pay very little overhead for writing a new tuple to the in-memory table. But executing an aggregation function, a query with aggregate functions like this actually requires to scan a lot of tuples on a table. That pays a lot of overhead if you run this dashboard query from time to time very frequently. So what we did is that we introduced something called materialized view as the first class citizen in VODB. Got a very good support for it. It's that when you add that tuple in memory, I also make another update in the memory to update the materialized view. A lot of the aggregate functions are distributed, meaning that like for the sum or the count, when you get the new tuple, you just increase the amount of count. You increment the count by one and you can increment the sum by the value. And when you delete a tuple, you just decrement that amount. Then that's distributed. And there are some that are not distributed like min and max. Then when you have a new tuple comes in, you can just calculate which is the smallest, which is the largest, and when you delete the tuple, then you just use the index to find the next largest or next smallest one to update the view. So we have another minimal operation attached to it that's not a lot overhead. Then whenever you need to query that dashboard and you need to get the result, instead of running that query all over again, you just query the materialized view table. It's a specialized table that we maintain in a memory that user cannot modify, but we can update internally. So reflect always up to date result for that particular query. Also one of the major work I did after I joined a company was adding some support for drawing queries, meaning you have the query. This is just for the aggregation from one particular table, but you can actually create materialized view from joining two or three tables. Of course, the performance are gonna go very bad if you're drawing a lot of query in the materialized view because you're gonna maintain the up to date result every time you update the table, but the R-case is good for it. Next thing is the imported exports. So a lot of things come similar to stream processing. If we do very fast transaction processing, then it starts to smell like stream processing. And to have real-time understanding of your data also has always have some values. People are interested in seeing. The stream processing is that you have the incoming data ingested into your system in a very fast rate, and those can trigger some... You can imagine that you have some data inputs from sensor networks or some other networks, the internet of things, and can have very fast ingestion data and that triggers some kind of processing like you need to summarize have a digest of that stream or you need to take some actions triggered by that ingestion action. So all kinds of stuff is to process a fast-moving data. If you know Kafka, it's Kafka. Yep, thank you for that, it's helpful. Yeah, so in VOTE, we built a series of different imported exporters. You can actually also build your custom imported exporters to take in that ingested and do some processing. In that case, you can do some processing inside the database after the data is ingested and then you can push it downwards to maybe a OLAP system downwards to do data warehousing. The last one we added is more SQL support. VOTE was criticized by its lack support of standard SQL. We have some basic SQL initial phase and when we started from the H-Store project over the years, we added more and more SQL support. Actually, if you run a no-SQL system, no one's gonna criticize for you the lack of SQL support, but you say I'm a SQL database and they start to criticize you. Yeah, you don't have this. Good day. Good day. Good day. Actually, I even faced that kind of question myself when I visited the customers to say, oh, we can do this in my SQL. I can do this in our code, but why don't I have to rewrite it? You should support that. You make my life miserable. Is it the dialect of SQL or is it the SQL functionality? The SQL functionality. So a lot of things we added to the system. We added user-defined functions. We added comment table expressions. I just list a couple of them and we have a very large project internally to replace the planner to get better plans. That's all in progress. We will cover CalCy for the event class next semester. Oh, great. You've got something to sign up for next semester. All right. So the final part, I will just briefly, because I don't think I have much time, briefly cover some of the new research directions. So everything you know about a Store engine, the single-threaded, this execution part, it's the core of this family of research. You have very fast OLTP transaction engines. The research is just to make them better. They'll have good capabilities for other things like stream processing. So we will cover sweep part here. There are more. Stream processing is one. And then we talked about the inefficiency of multi-part transaction because you have to wait for the network round trip time because you have to block the transaction before the finishes. So how do we make that better? It's one of the research topics. Also, for the in-memory database, there's always the question about what happens if I have datasets that are bigger than the memory? If I start off with in-memory database, it works fine. And over the years, my business grows exponentially and I suddenly find that I have a huge dataset that cannot fit in memory easily. Then what do I do? So there's research on that. So the stream processing part, there was a research prototype after HStore, it called the SStore. So that addresses some of the problems that VODB or HStore didn't have. Introduced some more new constructs that's specifically designed for the streams because they introduced them kind of, we call window, you also see similar concept in our database systems. So I don't want to this process or this work, I work on this on indefinite streams. I want to have some interest, the window that I'm interested like the last 10 minutes or even some certain number of rows last updated. Also, while the data ingestion happened and you need to have some triggers to trigger some corresponding processing of the actions that we're gonna take and in VODB, we'll schedule a single partition transaction that actually we don't guarantee that order. Like if you schedule transaction T1, T2 and T3, on the execution engine, it doesn't necessarily be the fact that you receive those transferring like T1, T2 and T3 in a single partition transaction, you might end up in executing T2, T1 and T3. So you cannot assume that the order that you queued that SQL query in your client application exactly in order that we receive and execute in a single-stranded engine. So they introduce what we call workflow that has some dependency on the transactions. You make sure that you define a kind of workflow that happens exactly as you expected for the stream processing. And also they have a way to expire tuples, they have the tuple TTL, which we also introduce in VODB version 8.2. This is the larger-than-memory data management. There are times that you can have larger-than-memory data set or if you just don't have machines that equipped with a very large amount of memory. So what do you do? So more often than not, you can always identify two kinds of data, the code data that's touched less in the transaction pass and the hot data that you are more popular in the transaction system. So with that observation, you're gonna have a serious approach to form a workflow to deal with that. So there are ways to identify hot tuples and code tuples by collecting and sampling and analyzing the transaction workload. And you can evict the code tuples to disk. You can foresee that you don't need these tuples in a short amount of time in the future. So you don't need that in the memory for immediate use. You can evict them to the disk. Then it comes to the question, when do you evict that and do you have a threshold or something or how do you track those evicted tuples and do you have some special state of structures to track those tuples that are gone and they went to disk? And when do you retrieve this tuple back if they were ever referenced by any transactions? I need them back. Then when do you retrieve them back and do you retrieve the tuple you need or you just retrieve the entire block that tuple was resided in? And when you retrieve that tuple, sort of address the problem of where to store that tuple. Do you immediately put the tuple back to the original place it was held in memory or held in a separate location and merge it to the main tuple storage at the later time? So there, yes? How is that different from a buffer pool? From what? A buffer pool. Buffer pool? Yes, a big question. So his question is that how this larger than memory data management is different from the buffer pool that we have in the disk-based database system? So I would say they came off with different architectural designs. So the buffer pool is just for the disk-based database system. The big assumption is the majority of your data resets on a disk and the mission of your buffer pool is that you manage the pages very well so that you minimize the disk access if you are not necessary. But this actually comes with different thinking in mind is that the majority, the vast majority of the data is still in memory, that's hot data, but you just put some of the data that you don't need in the near future to the disk to free some memory space. So they actually, the work is a similar way you're swapping tuples from between the memory and the disk, but they came up from the very different thinking behind. So buffer pool is caching. The data, you bring hot data from disk into memory. Anti-caching, what you're talking about is the reverse of that. Everything starts in memory and you take the whole data out of the disk. Conceptually the same, but how you architect the system is different. Right. So there are a lot of implementations. There's the famous anti-caching. Famous is not the right word, but. I don't know, I think I am well aware of that paper a long time ago. Maybe that's for me. So we have the anti-caching, a lot of other systems also do that like the in-memory versions, SQL Server, the Hecaten. Maybe there are many more. I cannot remember all of the names, but there are just a few of them. So that's for larger than memory to the management. And now those are the things, this is smarter scheduling. So if I end up having to block the partitions, can I do better to solve this multi-partition transaction problem, you maybe either way, either you handle it from the data perspective, you partition the data better. All the other way around, you cannot change much the partition scheme. How do I schedule a little bit better? Or I can do both. So smarter scheduling is some of that. So there are many ways, many kinds of optimizations. Most they come from this paper, and if you're interested, you can take a look. Actually, one of them is actually, we have a prototype for it. It's the second one, and partitions instead of all partition, meaning that instead of blocking all partitions, I can get an understanding like which exactly are the partitions out of all the partitions you're gonna use, like if I have four partitions, maybe my transaction, although it cannot run single partition, I can only use, I will only use two of them. So I will only lock two of them. I can let go the other two, I can, they can do other stuff. Or I can just deliver this transaction to the data heavy node, because there are data exchanges between the partitions, if you have a multi-partition transaction. For example, if you have a distributed sort or join, you need to shuffle some data from one to the other. So if I make the data heavy node as the coordinator, then most of the time I'm just retrieving some lightweight data from the other nodes, but I have the most heavy, the heaviest one on myself, I don't need to transfer it on the network. So there's also an optimization of disable and logging for a transaction that do not need to abort. That's only for, that doesn't help much for the multi-part transaction, because it's really hard to say. This speculative concurrency control is the idea is that, well, if I'm waiting for the coordinator to give me the feedback, whether you can commit or you need to abort, while I spend my time waiting there, why don't I just go ahead and execute some other transaction that's unlikely to overlap with the current multi-partition transaction that I execute. If I can have luck, then it turns out after the commit that the data I touched in this follow the speculative execution does not have any impact on the multi-partition transaction result. And I just cheated and I have the more efficiency. But if I am out of luck, then I just have some cascaded rollback and have abort a series, abort a series of single-part transactions. Another part is important for this paper is that they use a model to make a prediction to forecast the behavior of the stop procedure, which partition is more likely they're gonna touch and how do we choose the optimization for it. The smarter partitioning. So this is the other half of this problem. So instead of do better scheduling, I just make the partitioning better. So in this paper, there is a search algorithm called the large neighborhood search. So we sample the workload, get an idea how this transaction touches this data and we come up with some better partitioning scheme. And this paper also introduced a replicated secondary index, meaning that, well, if I am running a multi-part transaction, oh, no, sorry. I'm running a single-part transaction that involves another table. Well, if I join two tables, but the table I'm accessing, the other table I'm accessing is not partitioning on the same drawing key. So I need to have remote access to the other part of data. But if I know that in the head, I can create a secondary replicated index that has, you can consider this as a projection in Vertica. You have some column data in the replicated secondary index, so everyone has a copy of it. If I execute that kind of query, instead of going out and look for the data from the other nodes, I can just execute locally with that secondary index. To a step further than that is the eStore, it's the elastic partitioning. The idea is that if I can have some pattern of my workload, the fluctuation of my workload, maybe before Thanksgiving, it's kind of quiet and reach the peak at Thanksgiving and access some data with some pattern. Then how do I solve this? I can just responsively shuffle the data online. So introduce the two-tire partitioning. In addition to the harsh partition, harsh valid partition, the hash ring, you can partition some most frequently accessed data, the hot tuples in other ways. It's one-to-one mapping. I can have another partitioning scheme for that. So I will monitor the workload on a tuple level and I will just come up with a tuple replacement plan and I will shuffle those hot tuples. So the partition, the database is not strictly partitioned by that hash function that you determined, so the hash partition column you determined. Instead, on top of that, you have another auxiliary mechanism that you can shuffle the partition based on the hotness they are touching the transaction. All right, that wraps up the other contents. So today we talked about some history of the HSTOR and VODB, some key design decisions and questions. How do we design that? And the architectural overview, especially we emphasize on the partitioning scheme, how tables are replicated or partitioned and we'll also talk about a shared replicated to resolve that memory consumption issue. And we talked about when research product comes into industrial world, how do people see it? Like the disk durability, right? You have to add disk durability otherwise people don't feel safe and a lot of other enterprise features that we added to complete that product. And in the final phase, we talked about briefly some research front-end, follow that research. If you're interested, you can just find all those papers online on the HSTOR homepage. A lot of the things actually I'm really interested in and I feel very, VODB we have innovation weeks actually this week. We have this week with no engineering assignments that you can pick any project you're interested to work on. It's also a lot of the things I wanted to work to explore a little bit. It's very interesting. But someone could explore as part of the internship, correct? Right, of course, yeah. Yeah. So I'm ready to take any questions we have. All right, let's thank him. All right, it's time for a few questions. Yes. I think he was very involved with this. So her question was who's Andrew Pavlo? And I want Andrew Pavlo to raise his hand. So either hate this course or you love this course. The reason why this course exists is because of HSTOR. I got this job at CMU because of HSTOR, right? So what can I tell you? Yeah, depending what you love the course, you love the course and you love HSTOR. If you hate me and you hate the course, then you hate both of them and you hate HSTOR. If I had to do it all over again, I would have put Andrew Pavlo in my name and I would have put like a fake mental initial, like AXPOP or something like that. Yeah. Any real question? So what's the hardest thing having to work on in my database? So these guys have been struggling, the entire class are working on their GIST base database. Now there's issues with the grading scripts and that's ignore that, right? From an engineering standpoint, what's the hardest aspect, what's the hardest thing you have to be aware of when working on it in my database? I'll say... You can't read my question. Oh, okay, okay. Thanks for reminding me that. So, and this question was, what was the hardest thing that I ever worked on its in-memory database from the engineering standpoint? I'll say, I actually talked about this thing that's giving us a hard time as the share replicated when you start to change the things that you think, oh well, what was originally planned doesn't work well in some reality, some cases in reality, like the share replicated taking too much memory or even the disk durability issue that we have to build the command logging that's how you figure out, because we leave out so many things. We don't have locks of latch, we add it for share replicated for latching and we don't have this, we don't have this, a lot of things that we added later, that's, we don't have MVCC, we don't have concurrency control, but we have to start to think about it. When we start to work on command logging because you have to have two version as least. So, although two version concurrency control is a lot easier, but it's still something that we need to add back or to some extent, yeah. Yes, real quick. So, while the slide mentioned that there might be some query disordering, but I'm wondering why, I think the rather issue is then all the query together to be known based on some network protocol, like if it's reaching reliable, I think. So, why there is disordering for the query? So, his question is in one of my slides I showed up, especially for us, it's for the determinism. I said one of the things that's very important is the query order, if you run out of order, then your data will run out of the consistency. So, his question was why in what circumstances that you can have out of order queries. So, in many cases, there might be packet drops and you don't have that reliable network or if you think about the data replication scenario I talked in the slides, if I replicate this transaction to something, some other data centers actually, and during this process they may not be transferred exactly the same order as they were originally planned. So, I cannot give you a very concrete example. I do, so if you have non-determinism in the store procedure, Frank, the store procedure, again, we talked about it with a PLPG seal, again, if clauses. Oh, yes. So, if you have, if today equals Monday, excuse this query. Yeah, that's a good point. Excuse that query. So, if the two clocks are in sync, one guy thinks it's Tuesday, one guy thinks it's Monday, and you're fucked, right? So, they avoid all that by sort of preceding all the random number generators and the dates before the transaction. If you use Java run function, say, if random number mode, mode two equals one, I quell this query or else I quell that query, then you may have some, you have the out of order from the origin, the source. All right, let's thank him again. Thank you, everyone. All right, and I'm biased because the H-Drop or whatever. So, multi-me is open source, GPL license, it's Java plus C++. So, there's a community version if you wanna play with it. But, it's a little bit more tricky than a sort of my SQL or SQL life because you have to write these stored procedures. There is a terminal, you can write queries to it, but that's not the way you actually wanna run this. Okay? All right, guys. So, on Wednesday, we're gonna have the system potpourri. We have a shocking development. There is actually a new system at the top. So, this is different than every other year. So, I'm excited to see what you guys think about that. And then, we'll do also the final review at the beginning of the course. And then, I'll post the practice final exam on the website later this week. Okay? All right, guys. Good luck with your classes. Thank you. Ha ha ha ha! That's my favorite all-time job. Ha ha ha! Ha ha ha! What is it? Yes! It's the S-P Cricut-I-D-E-S. I make a mess unless I can do it like a Gio. Ice cube with the G to the E to the T. Then comes Duke. I play the game where there's no rules. Homies on the cuss, so y'all my food, cause I drink proof. Put the bus a cap on the ice, bro. Bushwick on the goal with a blow to the ice. Can I come? Will he eat that? Pack 40 act, gets the real bounce. I drink proof, but yo, I drink it by the 12 ounce. They say bill makes you fat. But saying eyes is straight, so it really don't matter.