 All right, everybody. Let's get started. Today, the paper for today is Amazon's Aurora paper, which is all about how to get a high performance, reliable database going as a piece of cloud infrastructure and itself built out of infrastructure that Amazon itself makes available. So the reason why we're reading this paper is that, first of all, it's a very successful recent cloud service from Amazon. A lot of their customers use it. It shows in its own way an example of a very big payoff from clever design. Table one, which summarizes the performance, shows that relative to some other system, which is not very well explained, the paper claims to get a 35-time speed up in transaction throughput, which is extremely impressive. This paper also explores the limits of how well you can do for performance and fault tolerance using general-purpose storage. Because one of the themes of the paper is they basically abandoned general-purpose storage. They switched from a design in which they were using their Amazon's own general-purpose storage infrastructure, decided it was not good enough and basically built totally application-specific storage. Furthermore, the paper has a lot of little tidbits about what turned out to be important in this cloud infrastructure world. So before talking about Aurora, I want to spend a bit of time kind of going over the back history or what my impression is about the story that led up to the design of Aurora. Because it's the sort of nth way that Amazon has in mind that you ought to build, that their cloud customers ought to build databases on Amazon's infrastructure. So in the beginning, Amazon had basically their very first offering, cloud offering, to support people who wanted to build websites but using Amazon's hardware and Amazon's machine room. Their first offering was something called EC2 for elastic cloud, apparently, too. And the idea here is that Amazon had big machine rooms full of servers, and they ran virtual machine monitors on their servers, and they'd rent out virtual machines to their customers. And their customers would then rent a bunch of virtual machines and run web servers and databases and whatever else they needed to run inside these EC2 instances. So the picture of one physical server looked like this. Amazon would control the virtual machine monitor on this hardware server, and then there'd be a bunch of guests, a bunch of EC2 instances, each one rented out to a different cloud customer. And each of these would just run a standard operating system like Linux, and then a web server or maybe a database server. And these were relatively cheap, relatively easy to set up, and it was a very successful service. So one little detail that's extremely important for us is that initially, the way you get storage, the way you got storage if you rented an EC2 instance was that every one of their servers had a disk attached, a physical disk attached, and each one of these instances that they rented to their customers would get a slice of the disk. So they just had locally attached storage, and you got a bit of locally attached storage, which itself just looked like a hard drive, an emulated hard drive to the virtual machine guests. EC2 is perfect for web servers, for stateless web servers. Your customers, their web browsers would connect to a bunch of rented EC2 instances that ran a web server. And if you added all of a sudden more customers, you could just instantly rent more EC2 instances from Amazon and fire up web servers on them and sort of an easy way to scale up your ability to handle web load. So it was good for web servers. But the other main thing that people ran in EC2 instances was databases. Because usually a website is constructed of a set of stateless web servers that any time they need to get at permanent data, go talk to a back end database. So what you would get is maybe a bunch of client browsers in the outside world, outside of Amazon's web infrastructure. And then a number of EC2 web server instances, as many as you needed to run the logic of the website, this is now inside Amazon. And then also typically one EC2 instance running a database. Your web servers would talk to your database instance and ask it to read and write records in the database. Unfortunately, EC2 wasn't perfect. Wasn't nearly as well suited to running a database as it was to running web servers. And the most immediate reason is that the storage or the main easy way to get storage for your EC2 database instance was on the locally attached disk attached to whatever piece of hardware your database instance was currently running on. And if that hardware crashed, then you also lost access to whatever what is on its hard drive. So if the hardware that was actually implementing a web server crashed, no problem at all. Because it really keeps no state itself. You just fire up a new web server on a new EC2 instance. If the EC2 instance is the hardware running it, crashes have become unavailable. You have a serious problem that the data stored on the locally attached disk. So initially at least there wasn't a lot of help for doing this. One thing that did work out well is that Amazon did provide this scheme for storing large chunks of data called S3. And you could take snapshots. You could take periodic snapshots of your database state stored in S3 and use that for backup disaster recovery. But that style of periodic snapshots means you're going to lose updates that happen between the periodic backups. All right, so the next thing that came along that's relevant to the Aurora database story is that in order to provide their customers with disks for their EC2 instances that didn't go away if there was a failure. That is more fault tolerant long term storage was guaranteed to be there. Amazon introduced a service called EBS. And this stands for Elastic Block Store. So what EBS is, is a service that looks to an EC2 instances, looks to one of these instances, one of these guest virtual machines, just as if it were a hard drive, an ordinary way. You could format it as a hard drive, put a file system like ext3 or whatever Linux file system you like on this thing. It looks to the guest just like a hard drive. But the way it's actually implemented is as a replicated pair of storage servers. So this is what a local storage looked like. If when EBS came out, then you could rent an EBS volume, which this thing that looks just like an ordinary hard drive. But it's actually implemented as a pair. So these are EBS servers, a pair of EBS servers, each with an attached hard drive. So if your software here, maybe you're running a database now and your database mounts one of these EBS volumes as its storage, when the database server does a write, what that actually means is that the write is sent out over the network. And using chain replication, which you talked about last week, your write is first written to the EBS server 1, the first EBS server that's backing your volume. And then the second one, and finally, you get the reply. And similarly, when you do a read, I guess in chain replication, you read the last of the chain. So now, databases running on EC2 instances had available a storage system that actually would survive the crash of or the depth of the hardware that they were running on. If this physical server died, you could just get another EC2 instance, fire up your database, and have it attached to the same old EBS volume that the previous version of your database was attached to. And it would see all the old data just as it had been dropped off by the previous database, just like you moved a hard drive from one machine to another. So EBS was really a good deal for people who needed to keep permanent state, like people running databases. One thing that is sort of important for us about EBS is that it's not a system for sharing. At any one time, only one EC2 instance, only one virtual machine can mount to give an EBS volume. So the EBS volumes are implemented on a huge fleet of hundreds or whatever storage servers with disks at Amazon. And they're all, everybody's EBS volumes are stored on this big pool of servers. But each EBS volume can be used by only one EC2 instance, only one customer. All right, still, EBS was a big step up. But it still has some problems. So there's still some things that are not quite as perfect as it could be. One is that if you run a database on EBS, it ends up sending large volumes of data across the network. And we're now starting to sort of sneak up on Figure 2 in the paper where they start complaining about just how many writes it takes if you run a database on top of a network storage system. So the database on EBS ended up generating a lot of network traffic. And one of the kind of things in the paper that the paper implies is that they are as much network limited as they are CPU or storage limited. That is, they pay a huge amount of attention to reducing the Aurora paper cements a huge amount of attention to reducing the network load that the database generates. And it seems to be worrying less about how much CPU time or disk space is being consumed. So that's sort of a hint at what they think is important. The other problem with EBS is not very fault tolerant. It turns out that for performance reasons, Amazon would always put both of the replicas of your EBS volume in the same data center. And so if a single server crashed, if one of the two EBS servers that you're using crashed, it's OK because you switched to the other one. But there was just no story at all for what happens if an entire data center went down. And apparently, a lot of customers really wanted a story that would allow their data to survive an outage of an entire data center. Maybe it lost its network connection, or it was a fire in the building, or a power failure to the whole building, or something. People really wanted to have at least the option if they're willing to pay more of having their data stored in a way that they could still get at it. Even if one data center goes down. And the way that Amazon would describe this is that both an instance and its EBS to EBS replicas are in the same availability zone. And an Amazon jargon and availability zone is a particular data center. And the way they structure their data centers is that there's usually multiple independent data centers in more or less the same city or relatively close to each other. And all the multiple availability zones, maybe two or three, that are nearby each other are all connected by redundant high-speed networks. So there's always pairs or triples of nearby availability centers. And we'll see the why that's important in a little bit. But at least for EBS, in order to keep the costs of using chain replication down, they required the two replicas to be in the same availability zone. All right. Before I dive into more into how Aurora actually works, it turns out that the details of the design, in order to understand them, we first have to know a fair amount about the design of typical databases. Because what they've taken is sort of the main machinery of a database MySQL, as it happens, and split it up in an interesting way. So we need to know what it is a database does so we can understand how they split it up. So this is really a kind of database tutorial really focusing on what it takes to implement transactions, crash-recoverable transactions. So what I really care about is transactions and crash-recovery. And there's a lot else going on in databases, but this is really the part that matters for this paper. So first, what's a transaction? A transaction is just a way of wrapping multiple operations on maybe different pieces of data and declaring that that entire sequence of operations should appear atomic to anyone else who's reading or writing the data. So you might see, supposing we're running a bank and we want to do transfers between different accounts. Maybe you would say, well, we would see code or see a transaction that looks like this, as you have to declare the beginning of the sequence of instructions that you want to be atomic in the transaction. Maybe we're going to transfer money from account y to account x. So we might see where we'll just pretend x is a bank balance stored in the database. You might see the transaction looks like, oh, I'm going to add $10 to x's account and deduct the same $10 from y account, and that's the end of the transaction. I want the database to just do them both without allowing anybody else to sneak in and see the state between these two statements. And also, with respect to crashes, if there's a crash at this point somewhere in here, we want to make sure that after the crash and recovery that either the entire transaction's worth of modifications are visible or none of them are. So that's the effect we want from transactions. Additionally, people expect, database users expect that the database will tell the client that submitted the transaction, whether the transaction really finished and committed or not. And if a transaction is committed, clients expect that the transaction will be permanent, will be durable, still there, even if the database should crash and reboot. One thing that's a bit important is that the usual way these are implemented is that the transaction locks each piece of data before it uses it. So you can view that there being locks on x and y for the duration of the transaction, and these are only released after the transaction finally commits that is known to be permanent. This is important for some of the details in the paper. Surely only makes sense if you realize that the database is actually locking out other access to the data during the life of a transaction. So how this actually implemented, it turns out the database consists of at least for the simple database model where the databases are typically written to run on a single server with some storage directly attached. And a game that the Aurora paper is playing is sort of moving that software only modestly revised in order to run on a much more complex network system. But the starting point is we just assume we have a database with attached with disk. The on disk structure that stores these records is some kind of indexing structure like a B tree maybe. So there's a sort of pages, what the paper calls, data pages that hold the real data of the database. Maybe this is x's balances, and this is y's balance. These data pages typically hold lots and lots of records, whereas x and y are typically just a couple bytes on some page in the database. So on the disk, there's the actual data, plus on the disk there's also a write-ahead log, or WAL. And the write-ahead log is sort of a critical part of why the system is going to be fault tolerant. Inside the database server, there's the database software. The database typically has a cache of pages that it's read from the disk that it's recently used. When you execute a transaction, what that actually like execute these statements, what that really means is what x equals x plus 10 turns into, the runtime is that the database reads the current page holding x from the disk and adds 10 to it. But so far, until the transaction commits, it only makes the modifications in the local cache not on the disk. We don't want to write on the disk yet and possibly expose a partial transaction. So because the database wants to sort of pre-declare the complete transaction, so it's available to the software after a crash and during recovery, before the database is allowed to modify the real data pages on disk, it's first required to add log entries that describe the transaction. So it has to, in order before it can commit the transaction, it needs to put a complete set of log-ahead entries in the write-ahead log on disk describing all the database's modifications. So let's suppose here that x and y start out as, say, 500 and y starts out as 750 and we want to execute this transaction. Before committing and before writing the pages, the database is going to add at least typically three log records. One that says, well, as part of this transaction, I'm modifying x and its old value is 500. Make more room here. This is the on-disk log. So each log entry might say, here's the value I'm modifying. Here's the old value and we're adding and here's the new value, say 510. So that's one log record. Another for y, maybe the old value is 750. We're subtracting 10, so the new value is 740. And then when the database, if it actually manages to get to the end of the transaction before crashing, it's going to write a commit record saying, and typically these are all tagged with a transaction ID so that the recovery software eventually will know, oh, this commit record refers to these log records. Yes? In a simple database would be enough to just store the new values and say, well, if it's a crash, we're going to just reapply all the new values. The reason most serious databases store the old values as well as the new value is to give them freedom to even for a long-running transaction, even before the transaction is finished, it gives the database the freedom to write the updated page to disk with the new value, 740, let's say, from an uncompleted transaction, as long as it's written the log record to disk. And then if there's a crash before the commit, the recovery software will say, well, this transaction never finished, therefore we have to undo all of its changes. And these old values are the values you need in order to undo a transaction that's been partially written to the data pages. So the Aurora, indeed, uses undo-redo logging to be able to undo partially applied transactions. OK, so if the database manages to get as far as getting the transaction's log records on the disk and the commit record marking is finished, then it is entitled to apply to the client. We said the transaction is committed, the database can reply to the client, and the client can be assured that its transaction will be sort of visible forever. And now one of two things happens. If the database server doesn't crash, then eventually, so it's modified in its cache, these X and Y records to be 510 and 740, eventually the database will write its cached updated blocks to their real places on the disk overwriting these B-tree nodes or something, and then the database can reuse this part of the log. So databases tend to be lazy about that because they like to accumulate. Maybe there will be many updates to these pages in the cache. It's nice to accumulate a lot of updates before being forced to write the disk. If the database server crashes before writing these pages to the disk so that they still have their old values, then it's guaranteed that the recovery software when you restart the database will scan the log, see these records for the transaction, see that that transaction was committed, and apply the new values to the stored data. And that's called a redo. Basically, redos all the writes in the transaction. So that's how transactional databases work in a nutshell. And so this is a sort of very extremely abbreviated version of how, for example, the MySQL database works, that Aurora is based on this open source software thing called MySQL, which does crash recovery and transaction and crash recovery much this way. OK, so the next step in Amazon's development of better and better database infrastructure for its cloud customers is something called RDS. And I'm only talking about RDS because it turns out that even though the paper doesn't quite mention it, Figure 2 in the paper is basically a description of RDS. So what's going on in RDS is that it was a first attempt to get a database that was replicated in multiple availability zones so that if an entire data center went down, you could get back your database contents without missing any writes. So the deal with RDS is that there's one, you have one EC2 instance that's the database server. You just have one, you're just running one database. It stores its data pages and log, basically, with this instead of on the local disk, it stores them in EBS. So whenever the database does a log write or a page write or whatever, those writes actually go to these two EBS volumes, EBS replicas. In addition, so this is in one availability zone, in addition for every write that the database software does, Amazon would transparently, without the database even realizing necessarily this was happened, also send those writes to a special setup in a second availability zone, in a second machine room, to just going from Figure 2 to apparently a separate computer or EC2 instance or something whose job was just to mirror writes that the main database did. So this other sort of mirroring server would then just copy these writes to a second pair of EBS servers. And so with this RDS setup, in Figure 2, every time the database appends to the log or writes to one of its pages, the data has to be sent to these two replicas, has to be sent on the network connection across the other availability zone on the other side of town, sent to this mirroring server, which would then send it to its two separate EBS replicas. And then finally, this reply would come back, and then only then would the write be finished. With the database see, aha, my write's finished. I can count this log record as really being a appendage of the log or whatever. So this RDS arrangement gets you a much better fault tolerance, because now you have a complete up-to-date copy of the database, like seeing all the very latest writes in a separate availability zone. Even if a fire burns down this entire data center, boom. You can run the database in a new instance in the second availability zone and lose no data at all. Yes? So this is not how EBS works? I don't know how to answer that. That is just not what they do. And my guess is that it would be that for most EBS customers, it would be too painfully slow to forward every write across to a separate data center. I'm not really sure what's going on, but I think the main answer is they don't do that. And this is a little bit of a workaround for the way EBS works to trick EBS into actually producing and using the existing EBS infrastructure unchanged. As Table 1 shows, this turns out to be extremely expensive. Or anyway, it's as expensive as you might think. We're writing fairly large volumes of data, because even this transaction, which seems like it just modifies two integers, like maybe eight bytes, or I don't know what, 16 who knows. Only a few bytes of data are being modified here. What that translates to as far as the database reading and writing the disk is, actually these log records are also quite small. So these two log records might, themselves, only be dozens of bytes long. So that's nice. But the reads and writes of the actual data pages are likely to be much, much larger than just a couple of dozen bytes, because each of these pages is going to be eight kilobytes or 16 kilobytes or some relatively large number of the file system or disk block size. And it means that just to read and write these two numbers when it comes time to update the data pages is a lot of data being pushed around under the disk. A locally attached disk, it's reasonably fast. But I guess what they found is when they start sending those big eight kilobyte writes across the network, that used up too much network capacity to be supported. And so this arrangement, this figure two arrangement, evidently was too slow. Yes? So in this figure two set up, the unknown to the database server, every time it called write or wrote its EBS disk, a copy of every write went over across availability zones and was written to both of these EBS servers and then acknowledged. And only then did the write appear to complete to the database. So it really had to wait for all four copies to be updated and for the data to be sent on the link across to the other availability zone. And as far as table one is concerned, that first performance table, the reason why the mirrored MySQL line is much, much slower than the Aurora line is basically that it sends huge amounts of data over these relatively slow network links. And that was the problem. That was the performance problem they're really trying to fix. So this is good for fault tolerance, because now we have a second copy in another availability zone, but it was bad news for performance. All right, the way Aurora, and the next step after this is Aurora, and the set up there, the high level view is we still have a database server, although now it's running custom software that Amazon supplies. So I can rent an Aurora server from Amazon, but I'm not running my software on it. I'm renting a server running Amazon's Aurora database software on it. So I rent an Aurora database server from them. And it's just one instance that sits in some availability zone. And there's two interesting things about the way it's set up. First of all, is that the data, its replacement, basically for EBS, involves six replicas now, two in each of three availability zones for super fault tolerance. So every time the database, and that's complicated and we'll talk, but basically when the database writes, we're not sure exactly how it's managed, but it more or less needs to send a write. One way or another writes have to get sent to all six of these replicas. The key to making, and so this looks like more replicas. Gosh, why isn't it slower? Why isn't it slower than this previous scheme which only had four replicas? And the answer to that is that what's being the only thing being written over the network is the log records. That's really the key to success is that the data that goes over these links in the center of the replicas is just the log records or log entries. And as you can see, a log entry here, at least in this simple example, it's not quite this small, but it's really not vastly more than a couple of dozen bytes needed to store the old value and the new value for the piece of data we're writing. So the log entries tend to be quite small. Whereas when the database, when we had a database that thought it was writing a local disk and it was updating its data pages, these tended to be enormous, like it doesn't really say. In the paper, I don't think that 8 kilobytes or more. So this setup here for each transaction was sending multiple 8 kilobyte pages across to the replicas. Whereas this setup is just sending these small log entries to more replicas. But the log entries are so very much smaller than the 8k pages that it's a net performance win. OK, so that's one. This is like one of their big insights, just in the log entries. Of course, a fallout from this is that their storage system is now not very general purpose. This is a storage system that understands what to do with MySQL log entries. It's not just EBS was a very general purpose. Just emulated a disk. You read and write blocks. It doesn't understand anything about anything except for blocks. This is a storage system that really understands that it's sitting underneath the database. So that's one thing they've done is ditched general purpose storage and switched to a very application-specific storage system. The other big thing to also go into in more detail is that they don't require that the writes be acknowledged by all six replicas in order for the database server to continue. Instead, the database server can continue as long as a quorum, which turns out to be four, as long as any four of these servers respond. So if one of these availability zones is offline, or maybe the network connection to it is slow, or maybe even just these servers just happen to be slow doing something else at the moment we're trying to write, the database server can basically ignore the two slowest or the two most dead of the servers when it's doing writes. It only requires acknowledgments from any four out of six, and then it can continue. And so this quorum scheme is the other big trick they use to help them have more replicas in more availability zones and yet not pay a huge performance penalty, because they never have to wait for all of them. Just the four fastest of the six replicas. And so the rest of the lecture is going to be explaining first quorums, and then log entries, and then this idea of just sending log entries, basically. Table one summarizes the result. If you look at table one, by switching from this architecture in which they send the big data pages to four places to this Aurora scheme of sending just the log entries to six replicas, they get an amazing 35 times performance increase over some other system. This system over here by playing these two tricks. And paper's not very good about explaining how much of the performance is due to quorums and how much is due to just sending log entries. But anyway, you slice it. 35 times improvement of performance is very respectable and, of course, extremely valuable to their customers and to them. It is transformative, I am sure, for many of Amazon's customers. All right. OK, so the first thing I want to talk about in detail is their quorum arrangement, what they actually mean by quorums. So first of all, the quorums is all about the arrangement of fault-tolerant, this fault-tolerant storage. So it's worth thinking a little bit about what their fault-tolerance goals were. So this is fault-tolerance goals. They wanted to be able to do writes, even if one reads and writes, even if one availability zone was completely dead. So they're going to write, even with one dead AZ. They wanted to be able to read, even if there was one dead availability zone, plus one other dead server. And the reason for this is that an availability zone might be offline for quite a while, because maybe it's suffered from a flood or something. And while it's down for a couple of days or a week or something, while people repair the damage from the flood, we're now reliant on just the servers and the other two availability zones. If one of them should go down, we don't want it to be a disaster. So they want to be able to write, even with one dead availability zone, but furthermore, they wanted to be able to read with one dead availability zone, plus one other dead server. So they wanted to be able to still read and get the correct data, even if there was one dead availability zone, plus one other server in the live availability zones were dead. So we have to take it for granted that they know their own business and that this is really the sweet spot for how fault tolerant you want to be. And in addition, as I already mentioned, they want to be able to ride out temporarily slow replicas. I think from a lot of sources, it's clear that if you read and write EBS, for example, you don't get consistently high performance all the time. Sometimes there's little glitches, because maybe some part of the network is overloaded or something is doing a software upgrade or whatever, and it's temporarily slow. So they want to be able to just keep going, despite transiently slow or maybe briefly unavailable storage servers. And a final requirement is that if a storage server should fail, it's a bit of a race against time before the next storage server fails, always the case. And it's not the statistics are not as favorable as you might hope, because typically you buy basically because server failure is often not independent. The fact that one server is down often means that there's a much increased probability that another one of your servers will soon go down, because it's identical hardware. Maybe bought from the same company, came off the same production line one after another. And so a flaw in one of them is extremely likely to be reflected in a flaw in another one. So you'll always nervous, oh, if there's one failure, boy, there could be a second failure very soon. And in a system like this, when it turns out in these quorum systems, you can only recover. It's a little bit like RAFT. You can recover as long as not too many of the replicas fail. So they really needed to have fast re-replication. That is, if one server seems permanently dead, we'd like to be able to generate a new replica as fast as possible from the remaining replicas. We have fast re-replication. So these are the main fault tolerance goals that paper lays out. And by the way, this discussion is only about the storage servers and what their failure characteristics, how to deal with failures, how to recover. And it's a completely separate topic what to do if the database server fails. And Aurora has a totally different set of machinery for noticing a database server's fail, creating a new instance, running a new database server on a new instance. Which is not what I'm talking about right now. We'll talk about it a little bit later on. Right now, it's just we want to build a storage system that's where the storage system is fault tolerant. OK, so they use this idea called quorums. And for a little while now, I'm going to describe the classic quorum idea, which dates back to the late 70s. So this is quorum replication. I'm going to describe to you the sort of abstract quorum idea. They use a variant of what I'm going to explain. And the idea behind quorum, quorum systems, is to be able to build storage systems that provide fault tolerant storage using replications and guarantee that even if some of the replicas fail, that reads will still see the most recent writes. And typically, quorum systems are sort of simple read-write systems, put-get systems. And they don't typically directly support more complex operations. Just you can read objects, you can read an object, or you can overwrite an entire object. And so the idea is you have n replicas. If you want to write, in order to write, you have to make sure your write is acknowledged by w, where w is less than n of the replicas. So w writes, you have to send each write to these w of the replicas. And if you want to do a read, you have to read information from at least r of the replicas. And so a typical setup, first of all, the key thing here is that w and r have to be set relative to n so that any quorum of w servers that you manage to send a write to must necessarily overlap with any quorum of r servers that any future reader might read from. And so what that means is that r plus w has to be greater than n so that any w servers must overlap in at least one server with any r servers. And so you might have three. We can imagine there's three servers, s1, s2, s3. Each of them holds. Let's say we just have one object that we're updating. We send out a write. Maybe we want to set the value of our object to 23. Well, in order to do a write, we need to get our new value onto at least w of the replicas. Let's say for this system that r and w are both equals 2 and n is equal to 3. Let's just set up. To do a write, we need to get our new value onto a quorum, onto at least two of the servers. So maybe we get our write onto these two. So they both now know that the value of our data object is 23. If somebody comes along and reads, a read also requires that the reader check with at least a read quorum of the servers. So that's also 2 in this setup. So that quorum could include a server that didn't see the write, but it has to include at least one other in order to get 2. So that means that any future read must, for example, consult both this server that didn't see the write plus at least one that did. That is, a read quorum and a write quorum must overlap in at least one server. So any read must consult a server that saw any previous write. Now, what's cool about this, well, actually, there's still one critical missing piece here. The reader is going to get back r results, possibly r different results. And the question is, how does a reader know which of the r results it got back from the r servers in this quorum, which one to actually use is the correct value? Something that doesn't work is voting. Like, just voting by popularity of the different values that gets back turns out not to work, because we're only guaranteed that a reader overlaps with a writer in at most one server. So that could mean that the correct value is only represented by one of the servers that the reader consulted. And in a system with, say, six replicas, you might have read quorum might be four. You might get back four answers, and only one of them is the answer that is the correct answer from the server in which you overlap with the previous write. So you can't use voting. And instead, these quorum systems need version numbers. So every write, every time you do a write, you need to accompany your new value with an increasing version number. And then the reader gets back a bunch of different values from the read quorum. And it can just use the moment with the highest version number. So that means that this 21 here, maybe S2 had an old value of 20. Each of these needs to be tagged with a version number. So maybe this is version number three. This was also version number three, because it came from the same original write. And we're imagining that this server that didn't see the write is going to have version number two. Then the reader gets back these two values. These two version numbers picks the value with the highest version number. And in Aurora, this was essentially about never mind about Aurora for a moment. OK, furthermore, if you can't actually contact a quorum for a reader or a write, you really just have to keep trying. Those are the rules. So you've got to keep trying until the servers are brought back up or connected again. So the reason why this is preferable to something like chain replication is that it can easily write out temporary, dead, or disconnected, or slow servers. So in fact, the way it would work is that if you want to read or write, you would send your newly written value. You would send the newly written value plus its version number to all of the servers, to all n of the servers. But only wait for w of them to respond. And similarly, if you want to read, you would, in a quorum system, you would send the read to all the servers and only wait for a quorum for r of the servers to respond. And because you only have to wait for r out of n of them, that means that you can continue after the fastest r have responded or the fastest w. And you don't have to wait for a slow server or a server that's dead. And the machinery for ignoring slow or dead servers is completely implicit. There's nothing here about, oh, we have to sort of make decisions about which servers are up or down or elect leaders or anything. It just kind of automatically proceeds as long as a quorum is available. So we get a very smooth handling of dead or slow servers. In addition, there's not much in the way for it here. Well, actually, even in this simple case, you can adjust the r and w to make either reads, to favor either reads or writes. So here, we could actually say that, well, the right quorum is 3. Every right has to go to all three servers. And in that case, the read quorum can be 1. So if you wanted to favor reads, with this set up, you get a read equals 1, write equals 3. Now, reads are much faster. They only have to wait for one server. But then return the writes are slow. If you wanted to favor writes, you could say that, oh, any reader has to read from all of them. But a writer only has to write 1. So that means only one server might have the latest value, but the readers have to consult all three. But they're guaranteed that their three will overlap with this. Of course, these particular values makes writes not fault tolerant. And here, reads not fault tolerant, because all the servers have to be up. So you probably wouldn't want to do this in real life. You might have, you would have, as Norora does, a larger number of servers and sort of intermediate numbers of read and write quorums. Aurora, in order to achieve its goals here of being able to write with one dead availability zone and read with one dead availability zone plus one other server, it uses a quorum system with n equals 6, w equals 4, and r equals 3. So the w equals 4 means that it can do a write with one dead availability zone. If this availability zone can't be contacted, well, these other four servers are enough to complete a write. The read quorum of 3, so 4 plus 3 equals 7. So they definitely guaranteed overlap. A read quorum of 3 means that even if one availability zone is dead plus one more server, the three remaining servers are enough to server read. Now, in this case where three servers are now down, the system can do reads and is, you know, can reconstruct the, can find the current state of the database, but it can't do writes without further work. So if they were in a situation where there was three dead servers, they have enough of a quorum to be able to read the data and reconstruct more replicas. But until they've created more replicas to basically replace these dead ones, they can't service writes. And also the quorum system, as I explained before, allows them to write out these transient, slow replicas. All right. As it happens, as I explained before, what the writes in Aurora aren't really overwriting objects as in a sort of classic quorum system. What Aurora, in fact, its writes never overwrite anything. Its writes just append log entries to the current log. So the way it's using quorums is basically to say, well, when the database sends out a new log record because it's executing some transaction, it needs to make sure that that log record is present on at least four of its storage servers before it's allowed to proceed with the transaction or commit it. So that's really the meaning of Aurora's write quorums, is that each new log record has to be appended to the storage in at least four of the replicas before the write can be considered to have completed. And when Aurora gets to the end of a transaction, before it can reply to the client and tell the client, ah ha, your transaction is committed and finished and durable, Aurora has to wait for acknowledgments from a write quorum for each of the log records that made up that transaction. And in fact, because if there were a crash and a recovery, you're not allowed to recover one transaction if preceding transactions aren't also recovered. In practice, before Aurora can acknowledge a transaction, it has to wait for a write quorum of storage servers to respond for all previously committed transaction and the transaction of interest. And then it can respond to the client. OK, so these storage servers are getting incoming log records. That's what writes look like to them. And so what do they actually do? They're not getting new data pages from the database server. They're just getting log records that describe changes to the data pages. So internally, one of these storage servers, it has internally, it has copies of all the data pages at some point in the data pages evolution. So it has maybe in its cache on its disk a whole bunch of these pages. Page one, page two, so forth. When a new write comes in, the storage server, when a new log record, when a new write arrives carrying with it just a log record, what has to happen someday, but not right away, is that the changes in that log record, the new value here, has to be applied to the relevant page. But we don't have the storage server doesn't have to do that until someone asks just until the database server or the recovery software asks to see that page. So immediately, what happens to a new log record is that the log records are just appended to lists of log records that affect each page. So for every page that the storage server stores, if it's been recently modified by a log record, by a transaction, what the storage server will actually store is an old version of the page plus the string of the sequence of log records that have come in from the database server since that page was last brought up to date. So if nothing else happens, the storage server just stores these old pages plus lists of log records. If the database server later evicts the page from its cache and then needs to read the page again for a future transaction, it'll send a read request out to one of the storage servers and say, look, I need an update copy of page one. And at that point, the storage server will apply these log records to the page, do these writes of new data that are described in the log records, and then send that updated page back to the database server. And presumably maybe then erase its list and just store the newly updated page, although it's not quite that simple. All right, so the storage servers just store these strings of log records plus old log page versions. Now, the database server, as I mentioned, sometimes needs to read pages. So by the way, one thing to observe is that the database server is writing log records, but it's reading data pages. So it's also different from a quorum system in the sense that the sort of things that are being read and written are quite different. In addition, it turns out that in ordinary operation, the database server knows, doesn't have to send quorum reads because the database server tracks for each one of the storage servers how far, how much of the prefix of the log that storage server has actually received. So the database server is keeping track of these six numbers. So first of all, log entries are numbered. Just 1, 2, 3, 4, 5. The database server sends out new log entries to all the storage servers. The storage servers have received them, responding, oh yeah, I got log entries 79. And furthermore, I have every log entry before 79 also. The database server keeps track of these numbers, how far each server has gotten, or what the highest sort of contiguous log entry number is that each of the servers has gotten. So that way, when the database server needs to do a read, it just picks a storage server that's up to date and sends the read request for the page it wants just to that storage server. So the database server does have to do quorum writes, but it basically doesn't ordinarily have to do quorum reads, and knows which of these storage servers are up to date and just reads from one of them. So the reads are much cheaper than they would be, and it just reads one copy of the page and doesn't have to go through the expense of a quorum read. Now, it does sometimes use quorum reads. It turns out that during crash recovery, if the crash, during crash recovery of the database server, and so this is different from the crash recovery of the storage servers, if the database server itself should crash, and maybe because it's running in an EC2 instance on some piece of hardware, some real piece of hardware, maybe that piece of hardware suffers a failure, the database server crashes, there's some monitoring infrastructure at Amazon that says, oh, wait a minute. The database, the Aurora database server were running for a customer or whatever just crashed, and Amazon will automatically fire up a new EC2 instance, start up the database software in that EC2 instance, and tell it, look, your data is sitting on this particular volume, this set of storage systems. Please clean up any partially executed transactions that are evident in the logs stored in these storage servers and continue. So that's the point at which Aurora uses quorum logic for reads, because this database server, when the previous database server crashed, it was almost certainly partway through executing some set of transactions. So the state of play at the time of the crash was, well, it's completed some transactions and committed them, and their log entries are on a quorum. Plus, it's in the middle of executing some other set of transactions, which also may have log entries on a quorum. But because the database server crashed midway through those transactions, they can never be completed. And for those transactions that haven't completed, in addition, there may be, we may have a situation in which maybe log entry, this server has log entry 101 and this server has log entry 102, and there's 104 somewhere. But no, as yet uncommitted transaction before the crash, maybe no server got a copy of log entry 103. So after crash, the new database servers are covering, it does quorum reads to basically find the point in the log, the highest log number for which every preceding log entry exists somewhere in the storage service. So basically, it finds the first missing, the number of the first missing log entry, which is 103, and says, well, as we're missing a log entry, we can't do anything with the log after this point, because we're missing an update. So the database server does these quorum reads, it finds, aha, 103 is the first entry that's mid. I look at my quorum, the servers I can reach and 103 is not there, and the database server will send out a message to all the servers saying, look, please just discard every log entry from 103 onwards. And those must necessarily not include log entries from committed transactions, because we know a transaction can't commit until all of its entries are on the right quorum. So we would be guaranteed to see them. So we're only discarding log entries from uncommitted transactions. Of course, so we're sort of cutting off the log here at log entry 102. These log entries that we're preserving now may actually include log entries from uncommitted transactions, from transactions that were interrupted by the crash. And the database server actually has to detect those, which it can by seeing, oh, a certain transaction, it has update entries in the log, but no commit record. The database server will find the full set of those uncommitted transactions and basically issue undue operations, sort of new log entries that undo all of the changes that those uncommitted transactions made. And that's the point at which Aurora needs these old values in the log entries so that a server that's doing recovery after a crash can sort of back out of partially completed transactions. All right, one, another thing I'd like to talk about is how Aurora deals with big databases. So so far, I've explained the storage setup as if, oh, a database just has these six replicas of its storage. And if that was all there was to it, basically, a database couldn't be, each of these is just a computer with a disk or two or something attached to it. If this were the way the full situation, then we couldn't have a database that was bigger than the amount of storage that you could put on a single machine. Because the fact that we have six machines doesn't give us six times as much usable storage, because each one of them is storing a replica of the same old data again and again. And so if we want to use solid state drives or something, we can put terabytes of storage on a single machine, but we can't put hundreds of terabytes on a single machine. So in order to support customers who need more than 10 terabytes and who need to have vast databases, Amazon will split up the database's data onto multiple sets of six replicas. So and the unit of sharding, the unit of splitting up the data, I think, is 10 gigabytes. So a database that needs 20 gigabytes of data will use two protection groups, these PG things too. Its data will sit on, half of it will sit on the six servers of protection group one, and then there'll be another six servers, possibly a different set of six storage servers, because Amazon's running in a huge fleet of these storage servers that are jointly used by all of its Aurora customers. The second 10 gigabytes of the database's 20 gigabytes of data will be replicated on another set of typically different. There could be overlap between these, but typically just a different set of six servers. And now we get 20 gigabytes of data, and we have more of these as a database goes bigger. One interesting piece of fallout from this is that while it's clear that you can take the data pages and split them up over multiple independent protection groups, maybe odd number data pages from your B-tree go on PG1 and even number of pages go on PG2. It's clear you can split up the data pages. It's not immediately obvious what to do with the log. How do you split up the log if you have two of these two protection groups or more than one protection group? And the answer that Amazon does is that Aurora uses is that the database server, when it's sending out a log record, it looks at the data that the log record modifies and figures out which protection groups store that data. And it sends each log record just to the protection groups that store data that's mentioned, that's modified in the log entry. And so that means that each of these protection groups stores some fraction of the data pages plus all the log records that apply to those data pages. So each of these protection groups stores a subset of the log that's relevant to its pages. So a final, maybe I erased the Fulton's requirements, but a final requirement is that if one of these storage servers crashes, we want to be able to replace it as soon as possible. Because if we wait too long, then we risk maybe three of them or four of them crashing. And if four of them crash, then we actually can't recover, because then we don't have a re-quorum anymore. So we need to regain replication as soon as possible. If you think about any one storage server, sure, this storage server is storing 10 gigabytes for my database's protection group. But in fact, the physical thing, the physical setup of any one of these servers is that it has maybe a 1 or 2 or something terabyte disk on it that's storing 10 gigabyte segments of 100 or more different Aurora instances. So what's on this physical machine is a terabyte or 10 terabytes or whatever of data in total. So when one of these storage servers crashes, it's taking with it not just the 10 gigabytes from my database, but also 10 gigabytes from 100 other people's databases as well. And what has to be re-replicated is not just my 10 gigabytes, but the entire terabyte or whatever or more that's stored on this server's solid-state drive. And if you think through the numbers, maybe we have 10 gigabit per second network interfaces. If we need to move 10 terabytes across a 10 gigabyte per second network interface from one machine to another, it's going to take, I don't know, 1,000 seconds, 10,000 seconds, maybe 10,000 seconds. And that's way too long. We don't want to have to sit there and wait. We don't want to have a strategy in which the way we reconstruct this is to have another machine that was replicating everything on it and had that machine send 10 terabytes to a replacement machine. We want to be able to reconstruct the data far faster than that. And so the actual setup they use is that if I have a particular storage server, it stores many, many segments, replicas of many 10 gigabyte protection groups. So maybe this protection group, maybe this segment that it's storing data for, the other replicas are these five other machines. So these are all storing segments of protection group A. And so there's a whole bunch of other ones that we're also storing. So maybe this particular machine also stores a replica for protection group B. But the other copies of the data for B are going to be put on a disjoint set of servers. So now there's five servers that have the other copies of B. And so on for all of the segments that this server that are sitting on, this storage server is hard drive for many, many different Aurora instances. So that means that this machine goes down. The replacement strategy is that we pick, if we say we're storing 100 of these segments on it, we pick 100 different storage servers, each of which is going to pick up one new segment. That is, each of which is going to now be participating in one more protection group. So one, we're going to select one server to re-replicate on for each of these 10 gigabyte segments. And now we have maybe 100 sort of different segment servers. And I'm probably storing other stuff, but they have a little bit of free disk space. And then for each of these, we pick one machine, one of the replicas that we're going to copy the data from, one of the remaining replicas. So maybe for A we're going to copy from there, for B from here. If we have five other copies of C, we pick a different server for C. And so we copy A from this server to that server, and B like this, and C like this. And so now we have 100 different 10 gigabyte copies going on in parallel across the network. And assuming we have enough servers that these can all be disjoint, and we have plenty of bandwidth in a switching network that connects them, now we can copy our terabyte or 10 terabytes or whatever of data in total. In parallel, with 100 fold parallelism, the whole thing will take 10 seconds or something instead of taking 1,000 seconds if there were just two machines involved. Anyway, so this is the strategy they use, and it means that they can recover from machine dies, they can recover in parallel from one machine's death extremely quickly. If lots of machine's dies doesn't work as well, but they can recover from single, they can re-replicate from single machine crashes extremely quickly. All right, so final thing that the paper mentions, if you look at figure three, you'll see that not only do they have this main database, but they also have replica databases. So for many of their customers, many of their customers see far more read-only queries than they see read-write queries. That is, if you think about a web server, if you just view a web page on some website, then chances are the web server you connect it to has to read lots and lots and stuff in order to generate all the things that are shown on the page to you. Maybe hundreds of different items have to be read out of the database or out of some database, but the number of writes for a typical web page view is usually much, much smaller. Maybe some statistics have to be updated or a little bit of history for you or something. So you might have a hundred to one ratio of reads to writes. That is, you may typically have a large, large, large number of straight read-only database queries. Now, with this setup, the writes can only go through the one database server, because we really can only support one writer for this storage strategy. And I think one place where the rubber really hits the road there is that the log entries have to be numbered sequentially, and that's easy to do if all the writes go through a single server and extremely difficult if we have lots of different servers all sort of writing in an uncoordinated way to the same database. So the writes really have to go through one database, but we could set up, and indeed Amazon does, set up a situation where we have read-only database replicas that can read from these storage servers. And so the full glory of figure three is that in addition to the main database server that handles the write request, there's also a set of read-only databases. And they say they can support up to 15, so you can actually get a lot of, if you're seeing a read-heavy workload, a lot of it can be, most of it can be sort of hived off to a whole bunch of these read-only databases. And when a client sends a read request to a read-only database, what happens is the read-only database figures out what data pages it needs to serve that request and sends reads directly into the storage system without bothering the main read-write database. So the read-only replica databases send page requests, read requests, directly to storage servers, and then they'll cache those pages so that they can respond to future read requests right out of their cache. Of course, they need to be able to update those caches and for that reason, Aurora also, the main database sends a copy of its log to each of the read-only databases and that's the horizontal lines you see between the blue boxes in figure three that the main database sends all the log entries to these read-only databases, which they use to update their cached copies to reflect recent transactions in the database. And it means, it doesn't mean that the read-only databases lag a little bit behind the main database, but it turns out for a lot of read-only workloads, that's okay, if you look at a webpage and it's 20 milliseconds out of date, that's usually not a big problem. There are some complexities from this, like one problem is that we don't want these read-only databases to see data from uncommitted transactions yet. And so in this stream of log entries, the database, main database sort of denotes which transactions have committed and the read-only databases are careful not to apply rights from uncommitted transactions to their caches, they wait until the transactions commit. The other complexity that these read-only replicas impose is that these structures here, these on-disk structures are quite complex. This might be a B tree, it might need to be rebalanced periodically, for example. And the rebalancing is quite a complex operation in which a lot of the tree has to be modified atomically and so the tree is incorrect while it's being rebalanced and you're only allowed to look at it after the rebalancing is done. If these read-only replicas directly read the pages out of the database, there's a risk they might see the B tree that the database that's being stored here in these data pages. They may see the B tree in the middle of a rebalancing or some other operation and the data's just totally illegal and they might crash or just malfunction. And when the paper talks about mini-transactions and the VDL versus VCL distinction, what it's talking about is the machinery by which the database server can tell the storage servers, look, this complex sequence of log entries must only be revealed all or nothing, atomically, to any read-only transactions. That's what the mini-transactions and VDL are about and basically when a read-only database asks to see data, a data page from a storage server, the storage server is careful to either show it data from just before one of these sequence, mini-transaction sequences of log entries or just after but not in the middle. All right. So that's all the technical stuff I have to talk about just to kind of summarize what's interesting about the paper and what can be learned from the paper. One thing to learn, which is just good in general and not specific to this paper, but everybody in systems should know is the basics of how transaction processing databases work and the sort of impact that the interaction between transaction processing databases and the storage systems, because this comes up a lot. It's like a pervasive, the performance and crash recoverability complexity of running a real database just comes up over and over again in systems design. Another thing to learn from this paper is this idea of quorums and the overlap, the technique of overlapping read and write quorums in order to always be able to see the latest data but also get fault tolerance. And of course this comes up in RAFT also. RAFT has a strong kind of quorum flavor to it. Another interesting thought from this paper is that the database and the storage system are basically co-designed as kind of an integrated, there's integration across the database layer and the storage layer. Ordinarily we try to design systems so that they have good separation between consumers of services and the sort of infrastructure services. Like typically storage is very general purpose, not aimed at a particular application, just because that's a pleasant design and it also means that lots of different uses can be made of the same infrastructure. But here the performance issues were so extreme, they were able to get a 35 times performance improvement by sort of blurring this boundary. This was a situation in which general purpose storage was actually really not advantageous and they got a big win by abandoning that idea. And a final set of things to get out of the papers, all the interesting sometimes kind of implicit information about what was valuable to these Amazon engineers who really know what they're doing, about what concerns they had about cloud infrastructure, like the amount of worry that they put into the possibility that an entire availability zone might fail is an important tidbit. The fact that transient slowness of individual storage service was important is another thing that actually also comes up a lot. And finally the implication that the network is the main bottleneck, because after all they went to extreme lengths to send less data over the network, but in return the storage service have to do more work and they're willing to have six copies of the data and have six CPUs all replicating the execution of applying these redo log entries. Apparently CPU is relatively cheap for them whereas the network capacity was extremely important. All right, that's all I have to say and see you next week.