 All right, today I want to talk about a bit more about fault tolerance and replication, and then look into the details of today's paper about VMware FT. The topic's still fault tolerance to provide high availability. That is, you want to be able to server it, even if some hardware, computer crashes is involved in the service, we still like to provide the service. And to the extent we can, we'd like to provide our service also if there's network problems. And the tool we're using is replication, at least for this part of the course. So it's worth asking what kind of failures replication can be expected to deal with, because it's not everything by any means. So maybe the easiest way to characterize the kind of failures we're talking about is fail-stop failures of a single computer. And what I mean by fail-stop, it's a sort of generic term in fault tolerance is that if something goes wrong with, say, the computer, the computer simply stops executing. It just stops if anything goes wrong. And in particular, it doesn't compute incorrect results. So if somebody kicks the power cable out of your server, that's probably going to generate a fail-stop failure. Similarly, if they unplug your server's network connection, even though the server's still running, so it's just a little bit funny, it'll be totally cut off from the network. So it looks from the outside like it just stopped. So it's really these failures we can deal with, with replication. This also covers some hardware problems, like maybe if the fan on your server breaks because it only costs $0.50, maybe that'll cause the CPU to overheat, and the CPU will shut itself down cleanly and just stop executing. What's not covered by the kind of replication systems we're talking about is things like bugs in software or design defects in hardware. So basically not bugs. Because if we take some service, say, your MapReduce Master, for example, we replicated and run it on two computers, if there's a bug in your MapReduce Master or my MapReduce Master, let's say, replication's not going to help us. We're going to compute the same incorrect result on both of our copies of our MapReduce Master and everything will look fine. They'll agree. It'll just happen to be the wrong answer. So we can't defend against bugs in the replicated software. And we can't defend against bugs in whatever scheme we're using to manage the replication. And similarly, as I mentioned before, we can't expect to deal with bugs in the hardware. If the hardware computes incorrectly, that's just the end for us, at least with this kind of technique. Although that said, there are definitely hardware and software bugs that replication might, if we're lucky, might be able to cope with. So if there's some unrelated software running in your server and it causes the server to crash, maybe it causes your kernel to panic and reboot, or something that has nothing to do with the service you're replicating, then that kind of failure for your service will may well be fail-stop. The kernel will panic, and the backup replica will take over. Similarly, some kinds of hardware errors can be turned into fail-stop errors. For example, if you send a packet over the network and the network corrupts, it just flips a bit in your packet that will almost certainly be caught by the checksum on the packet. Same thing for a disk block. If you write some data to disk and read it back a month later, maybe the magnetic surface isn't perfect, and one of the bits, a couple of bits were wrong in the block as it's read back. There's actually error-correcting code that, up to a certain point, will fix errors in disk blocks that you'd be turning random hardware errors into, and it's either correcting them if you're super lucky, or at least detecting them and turning random corruption into a detected fault, which the software then knows that something went wrong and can turn it into a fail-stop fault by stopping, executing, or take some other remedial action. But in general, we really can only expect to handle fail-stop faults. There's other limits to replication, too. The failures in the, if we have a primary in the backup or two replicas or whatever, we're really assuming that failures in the two are independent. If they tend to have correlated failures, then replication is not going to help us. So for example, if we're a big outfit and we buy thousands of computers, batches of thousands of computers, identical computers in the same manufacturer, and we run our replicas on all those computers we bought at the same time from the same place, that's a bit of a risk, maybe. Because presumably, if one of them has a manufacturing defect in it, there's a good chance that the other ones do, too. One of them is prone to overheating because the manufacturer didn't provide enough airflow. Well, they probably all have that problem. And so one of them overheats and dies. There's a good chance that the other ones will, too. So that's one kind of correlated failure you just have to be careful of. Another one is that if there's an earthquake in the city where our data center is, probably going to take out the whole data center. We can have all the replication we like inside that data center, it's not going to help us because the failure caused by an earthquake or a city-wide power failure or something, the building burning down, it's like it's a correlated failure between our replicas if they're all in that building. So if we care about dealing with earthquakes, then we need to put our replicas in maybe in different cities, at least physically separate enough that they have separate power unlikely to be affected by the same natural disaster. Okay, but that's all sort of hovering in the background for this discussion where we're talking about the technology you might use. Another question about replication is whether it's worthwhile. You may ask yourself, gosh, this literally uses these replication schemes, use twice as much or three times as much computer resources. We need to have, you know, GFS had three copies of every block, so we have to buy three times as much disk space. The paper for today replicates just once, but that means we have twice as many computers and CPUs and RAM, it's all very expensive. Like, is that really worth it, that expense? And you know, that's not something we can answer technically, right? It's an economic question. It depends on the value of having an available service. You know, if you're running a bank and if the consequences of the computer failing is that your customer, you can't serve your customers and you can't generate revenue and your customers all hate you, then it may well be worth it to blow an extra 10 or 20,000 bucks on a second computer so you can have a replica. On the other hand, if you're me and you're running the 6824 web server, I don't consider it worthwhile to have a hot backup of the 824 web server because the consequences of failure are very low. So, whether the replication is worthwhile and how many replicas you ought to have and how much you're willing to spend on it is all about how much cost and inconvenience failure would cause you. But this paper sort of in the beginning mentions that there's a couple of different approaches to replication, really mentions two. One that calls state transfer and the other calls replicated state machine. Most of the schemes we're gonna talk about in this class are replicated state machines but I'll talk about both anyway. The idea behind state transfer is that if we have replicas of a server, the way you cause them to be to stay in sync that is to be actual replicas so that the backup has everything it needs to take over if the primary fails in a state transfer scheme, the way that works is that the primary sends a copy of its entire state, that is for example, the contents of its RAM to the backup and the backup just sort of stores the latest state and so it's all there if the primary fails and the backup can start executing with the last state it got if the primary fails. So this is all about sending the state of the primary and if today's paper worked as a state transfer system which it doesn't then the state we'd be talking about would be the contents of the RAM, the contents of the memory of the primary. So maybe every once in a while the primary would just make a big copy of its memory and send it across the network to the backup. You can imagine if you wanted to be efficient maybe you would only send the parts of the memory that it's changed since the last time you sent memory to the backup. The replicated state machine, this approach observes that most services or most computer things you wanna replicate have some internal operation that's deterministic except when external input comes in. Ordinarily if there's no external influences on a computer it just executes one instruction after another and what each instruction does is a deterministic function of what's in the memory and the registers of the computer and it's only when external events intervene that something unexpected may happen like a packet arrives at some random time and that causes the server to start doing something differently. So replicated state machine schemes don't send the state between the replicas instead they just send those external events. They just send maybe from a primary to a backup again just send things like arriving input from the outside world that the backup needs to know about. And the observation is that if you have two computers and they start from the same state and they see the same inputs in the same order at the same time the two computers will continue to be replicas of each other and sort of execute identically as long as they both see the same inputs at the same time. So this transfers probably memory and this transfers from primary backup to just operations from clients or external inputs or external events. And the reason why people tend to favor replicated state machine is that usually operations are smaller than the state. The state of a server, if it's a database server might be the entire database might be gigabytes whereas the operations are just some clients sending please read or write key 27. Operations are usually small but the state's usually large. So replicated state machine usually looks attractive. Slight downside is that the schemes tend to be quite a bit more complicated and rely on sort of more assumptions about how the computers operate whereas this is a really heavy handed I'm just gonna send you my whole state it's really nothing to worry about. Any questions about these strategies? Yes. Well, okay so the question is suppose something went wrong with our scheme and the backup was not actually identical to the primary. So, you know, you're supposed to be running a GFS master and it's the primary it's just handed out a lease to chunk server one but because we've allowed the states of the primary in the back to drift out of sync the backup did not issue a lease to anybody it wasn't even aware anybody had asked for a lease and now the primary thinks chunk server one has a lease for some chunk and the backup doesn't. The primary fails, backup takes over. Now chunk server one thinks it has a lease for some chunk but then the current master doesn't and is happy to hand out the lease to some other chunk server and now we have two chunk servers serving the same lease. So that's just a close to home example but really almost any bad thing kind of I think you can construct any bad scenario by just imagining some service that computes the wrong answer because the states leverage, yes. So you're asking about randomization. Yeah, I'll talk about this a bit later on but it is a good the replicated state scheme definitely makes the most sense when the instructions that the primary in the back of our executing do the same thing as long as there's no external advance and that's almost true for an ad instruction or something. Yeah, you know if the starting if the registers memory of the same and they both execute an ad instruction the ad instruction is gonna have the same inputs and the same outputs but there are some instructions as you point out that don't like maybe there's an instruction that gets the current time of day that probably be executed at slightly different times or an instruction that gets the current processor's unique ID, serial number it's gonna yield the different answers and the uniform answer to questions that sound like this is that the primary does it and sends the answer to the backup and the backup does not execute that instruction but instead at the point where it would execute that instruction it listens for the primary to tell it what the right answer would be and just sort of fakes that answer to the software. I'll talk about how the VMware scheme does that. Okay, interestingly enough though today's paper is all about a replicated state machine you may have noticed that today's paper only deals with unit processors and it's not that clear how it could be extended to a multi-core machine where the interleavings of the instructions from the two cores are non-deterministic so we no longer have this situation on a multi-core machine where if we just let the primary backup execute they're all else being equal they're gonna be the same because they won't if they're executed on multiple cores. VMware has since come out with a new possibly completely different replication system that does work on multi-core and the new system appears to me to be using state transfer instead of replicated state machine because state transfer is more robust in the face of multi-core and parallelism. If you freeze the machine and send the memory over that the memory image is just that just is the state of the machine and sort of it doesn't matter that there was parallelism whereas the replicated state machine scheme really has a problem with parallelism. You know on the other hand I'm guessing that this new multi-core scheme is more expensive. Okay, all right. So if we wanna build a replicated state machine scheme we got a number of questions to answer. So we need to decide at what level we're gonna replicate state, right? So what state, what do we mean by state? We have to worry about how closely synchronized the primary and backup have to be, right? Because it's likely the primary will execute a little bit ahead of the backup after all it's the primary that sees the inputs. So the backup almost necessarily must lag. That gives it that means there's an opportunity if the primary fails for the backup not to be fully caught up. Having the backup actually execute really unlocks up with the primary is very expensive because it requires a lot of chit chat. So a lot of designs, a lot of what people sweat about is how close the synchronization is. If the primary fails or you know actually if the backup fails too but it's more exciting if the primary fails there has to be some scheme for switching over and the clients have to know, oh gosh, instead of talking to the old primary on server one, I should now be talking to the backup on server two. So all the clients have to somehow figure this out. The switchover almost certainly, it's almost impossible, maybe impossible to design a cutover system in which no anomalies are ever visible. In the sort of ideal world, if the primary fails we'd like nobody to ever notice and none of the clients to notice. Turns out that's basically unattainable. So there's gonna be anomalies during the cutover and we've got to figure out a way to cope with them. And finally, if one of the two, if one of our replicas fails we really need to have a new replica, right? If we have two replicas and one fails we're just living on borrowed time, right? Because the second replica may fail at some point. So we absolutely need to get a new replica back online as fast as possible. So, and that can be very expensive. The state is big, you know, the reason we liked replicated state machine was because we thought state transfer would be expensive but the two replicas and a replicated state machine still need to have full state, right? We just had a cheap way of keeping them both in sync. If we need to create a new replica we actually have no choice but state transfer to create the new replica. The new replica needs to have a complete copy of the state. So it's gonna be expensive to create new replicas and this is often people spend, well actually people spend a lot of time worrying about all these questions and we'll see them again as we look at other replicated state machine schemes. So on the topic of what state to replicate, today's paper has a very interesting answer to this question. It replicates the full state of the machine that is all of memory and all the machine registers. It's like a very, very detailed replication scheme. Just no difference at even the lowest levels between the primary and the backup. That's quite rare for replication schemes. Almost always you see something that's more like GFS where GFS absolutely did not replicate. You know, it had replication but it wasn't replicating every single bit of memory between the primaries and the backups. It was replicating much more application level table of chunks, right? It had this abstraction of chunks and chunk identifiers and that's what it was replicating. It wasn't replicating sort of everything else. It wasn't going to the expense of replicating every single other thing the machines were doing. It's okay as long as they had the same sort of application visible set of chunks. So most replication schemes out there go the GFS route. In fact, almost everything except pretty much this paper and a few handful of similar systems. Almost everything uses application at some level application level of replication because it can be much more efficient because we don't have to go to the trouble of for example making sure that interrupts occur at exactly the same point in the execution of the primary backup. GFS does not sweat that at all but this paper has to do because it replicates at such a low level. So most people build efficient systems with application specific replication. The consequence of that though is that the replication has to be built into the application, right? If you're getting a feed of application level operations for example, you really need to have the application participate in that because some generic replication thing like today's paper doesn't really, can't understand the semantics of what needs to be replicated. So anyway, so most things are application specific like GFS and every other paper we're gonna read on this topic. Today's paper is unique in that it replicates at the level of the machine and therefore does not care what software you run on it. It replicates the low level memory and machine registers. You can run any software you like on it and as long as it runs on that kind of microprocessor that's being replicated, this replication scheme applies to, the software can be anything. And the downside is that it's not that efficient necessarily. The upside is that you can take any existing piece of software. Maybe you don't even have source code for it or understand how it works. And within some limits, you can just run it under this under VMware's replication scheme and it'll just work. Which is a sort of magic fault-tolerance wand for arbitrary software. Now let me talk about how this is VMware or FT works. First of all, VMware is a virtual machine company. They're what their business is. A lot of their business is selling virtual machine technology. And what virtual machines refer to is the idea of you buy a single computer and instead of booting an operating system like Linux on the hardware, you boot what we'll call a virtual machine monitor or hypervisor on the hardware. And the hypervisor's job is actually to simulate multiple multiple computers, multiple virtual computers on this piece of hardware. So the virtual machine monitor may boot up one instance of Linux or maybe multiple instances of Linux, maybe a Windows machine. The virtual machine monitor on this one computer can run a bunch of different operating systems. Each of these is itself some sort of operating system kernel and then applications. So this is the technology they're starting with. And the reason for this is that if you need to, just turns out there's many, many reasons why it's very convenient to kind of interpose this level in direction between the hardware and operating systems. It means that we can buy one computer and run lots of different operating systems on it. We can have each, if we run lots and lots of little services instead of having to have lots and lots of computers one per service, you can just buy one computer and run each service in the operating system that it needs using these virtual machines. So this was their starting point. They already had this stuff and a lot of sophisticated things built around it at the start of designing VMware FT. So this is just virtual machines. What the paper's doing is that it's gonna set up one machine or well, it requires two physical machines because there's no point in running the primary and backup software in different virtual machines on the same physical machine because we're trying to guard against hardware failures. So you're gonna have two, at least, you're gonna have two machines running their virtual machine monitors and the primary is gonna run on one and the backup's on the other. So on one of these machines we have a guest, you know, it might be running a lot of virtual machines, we only care about one of them. It's gonna be running some guest operating system and some sort of server application, maybe a database server or MapReduceMaster or something. So we'll call this the primary and there'll be a second machine that runs the same virtual machine monitor and an identical virtual machine holding the backup. So we have the same, whatever the operating system is, exactly the same and the virtual machine is, you know, giving these guests operating systems, the primary and backup each range of memory and this memory images will be identical or the goal is to make them identical in the primary and the backup. All right, we have two physical machines, each one of them running a virtual machine guest with its own copy of the service we care about. We're assuming that there's a network connecting these two machines and in addition on this, I'll call it a local area network, in addition on this network there's some set of clients, really they don't have to be clients, they're just maybe other computers that our replicated service needs to talk with, some of them are clients that are sending requests. It turns out in this paper, their replicated service actually doesn't use a local disk and instead assumes that there's some sort of disk server that it talks to and although it's a little bit hard to realize this from the paper, the scheme actually does not really treat the server particularly especially it's just another external source of packets and place that the replicated state machine may send packets to, not very much different from clients. Okay, so the basic scheme is that we assume that these two replicas, the two virtual machines, primary and backup are exact replicas, some client, database client who knows what, some client of our replicated server sends a request to the primary, that really takes the form of a network packet, that's what we're talking about, that generates an interrupt and this interrupt actually goes to the virtual machine monitor at least in the first instance. The virtual machine monitor sees aha, here's the input for this replicated service and so the virtual machine monitor does two things, one is it sort of simulates a network packet arrival, interrupt into the primary guest operating system to deliver it to the primary copy of the application and in addition, the virtual machine monitor knows that this is an input to a replicated virtual machine and so it sends back out on the network a copy of that packet to the backup virtual machine monitor. Which also gets the backup virtual machine monitor knows, aha this is a packet for this particular replicated state machine in it, also fakes a sort of network packet arrival interrupt at the backup and delivers the packets and now both the primary and the backup have a copy of this packet, they both have the same input, with a lot of details they're gonna process it in the same way and stay synchronized. Of course the service is probably gonna reply to the client, on the primary the service will generate a reply packet and send it on the NIC that the virtual machine monitor is emulating and then the virtual machine monitor will see that output packet on the primary, they'll actually send the reply back out on the network to the client because the backup is running exactly the same sequence of instructions, it also generates a reply packet back to the client and sends that reply packet on its emulated NIC. It's the virtual machine monitor that's emulating that network interface card and it says aha, the virtual machine monitor says I know this was the backup, only the primary is allowed to generate output and the virtual machine monitor drops the reply packet. So both of them see inputs and only the primary generates outputs. As far as terminology goes, the paper calls this stream of input events and other things, other events we'll talk about from the stream is called the logging channel. It all goes over the same network presumably but these events, the primary since the backup are called log events on the log channel. Where the fault tolerance comes in is that suppose the primary crashes. What the backup is gonna see is that it stops getting stuff getting log entries on the logging channel. And we know it turns out that the backup can expect to get many per second because one of the things that generates log entries is periodic timer interrupts in the primary. Each one of which turns out every interrupt generates a log entries into the backup. These timer interrupts are gonna happen like a hundred times a second. So the backups can certainly expect to see a lot of chit-chat on the logging channel if the primary's up. If the primary crashes, then the virtual machine monitor over here will say, gosh, I haven't received anything on the logging channel for like a second or however long the primary must be dead or something. And in that case, when the backup stops seeing log entries from the primary, the paper, the way the paper phrases it is that the backup goes live. And what that means is that it stops waiting for these input events on the logging channel from the primary. And instead, this virtual machine monitor just lets this backup execute freely without being driven by input events. From the primary, the VMM does something to the network to cause future client requests to go to the backup instead of the primary. And the VMM here stops discarding the backup. Of course, now it's the primary, not the backup. Stops discarding output from this virtual machine. So now the server machine directly gets the inputs and is allowed to produce output. And now our backup is taken over. And similarly, this is less interesting but has to work correctly. If the backup fails, a similar, the primary has to use a similar process to abandon the backup, stop sending and events, and just act much more like a single, non-replicated server. So either one of them can go live. If the other one appears to be dead, stops generating network traffic. Yes? Magic. Now, it depends on what the networking technology is. I think what the paper, one possibility is that this is sitting on ethernet. Every physical computer on the ethernet, or really every NIC, has a 48-bit unique ID. I'm making this up now. It could be that, in fact, instead of each physical computer having a unique ID, each virtual machine does. And when the backup takes over, it essentially claims the primary's ethernet ID as its own. And it starts saying, you know, I'm the owner of that ID, and then other people on the ethernet will start sending us packets. That's my interpretation of what they're up to. The designers believe they had identified all such sources. And for each one of them, the primary does whatever it is, executes the random number generator instruction, or takes and interrupts at some time. The backup does not. And the backup virtual machine monitor sort of detects any such instruction and intercepts it and doesn't do it. And instead, the backup waits for an event on the logging channel saying, this instruction number, the random number was whatever it was on the primary. At which? Yes? Yes? Yeah, the paper hints that they got Intel to add features to the microprocessor to support exactly this. But they don't say what it was. OK, so on that topic, so far the story is sort of assumed that as long as the backup just sees the package from the clients, it'll execute identically to the primary. That's actually glossing over some huge and important details. So one problem is that, as a couple of people have mentioned, there are some things that are nondeterministic. It's not the case that every single thing that happens in the computer is a deterministic function of the contents of the memory of the computer. It is for sort of straight line code execution often, but certainly not always. So what we're worried about is things that may happen that are not a strict function of the current state. That is, that might be different if we're not careful on the primary and backup. So these are sort of nondeterministic events that may happen. So the designers have to sit down and figure out what they all were. And here are the ones, here's the kind of stuff they talk about. So one is inputs from external sources like clients, which arrive just whenever they arrive. They're not predictable. There are no sense in which the time at which a client request arrives or its content is a deterministic function of the service of state, because it's not. So these actually, this system is really dedicated to a world in which services only talk over the network. And so the only, really, basically the only form of input or output in this system that's supported by this system seems to be network packets coming and going. So when input arrives, what that really means is the packet arrives. And what a packet really consists of for us is the data in the packet plus the interrupt that signaled that the packet had arrived. So that's quite important. So when a packet arrives, ordinarily the NIC DMAs the packet contents into memory and then raises an interrupt, which the operating system feels. And the interrupt happens at some point in the instruction stream. And so both of those have to look identical on the primary and backup, or else our execution is going to diverge. And so the real issue is when the interrupt occurs exactly at which instruction the interrupt happened to occur. That better be the same on the primary and the backup. Otherwise, their execution is different and their states are going to diverge. So we care about the content of the packet and the timing of the interrupt. And then as a couple of people have mentioned, there's a few instructions that behave differently on different computers or differently depending on something like there's maybe a random number generator instruction. There's get time and day instructions that will yield different answers that call it the different times and unique ID instructions. Another huge source of nondeterminism, which the paper basically rules out, is multi-core parallelism. This is a uniprocessor only system. There's no multi-core in this world. The reason for this is that if it allowed multi-core, then the service would be running on multiple cores. And the instructions of the service that are executing on the different cores are interleaved in some way, which is not predictable. And so really, if we run the same code on the backup and the server, if it's parallel code running on a multi-core, the two will interleave the instructions in the two cores in different ways, the hardware will. And that can just cause different results. Because supposing the code on the two cores, they both ask for a lock on some data. Well, on the master, core one may get the lock before core two. On the slave, just because of a tiny timing difference, core two may get the lock first. And the execution results are totally different, likely to be totally different if different threads get the lock. So multi-core is a grim source of non-determinism, and it's just totally outlawed in this paper's world. And indeed, as far as I can tell, the techniques are not really applicable to multi-core. The service can't use multi-core parallelism. The hardware is almost certainly multi-core parallel, but that's the hardware sitting underneath the virtual machine monitor. The machine that the virtual machine monitor exposes to one of the guest operating systems that runs the primary backup, that emulated virtual machine is a uniprocessor machine in this paper. And I'm guessing there's not an easy way for them to adapt this design to multi-core virtual machines. Okay, so these are really, it's these events that go over the logging channel. And so the format of a log record, a log entry, I don't quite say, but I'm guessing that there's really three things in a log entry. There's the instruction number at which the event occurred because if you're delivering an interrupt or input or whatever, it better be delivered at exactly the same place in the primary and backup. So we need to know the instruction number. And by instruction number, I mean the number of instructions since the machine booted, not the instruction address, but like, oh, we're executing a four billion, then 79 instructions since boot. So log entry is gonna have an instruction number. For an interrupt, for input, it's gonna be the instruction at which the interrupt was delivered on the primary. And for a weird instruction like get time of day, it's gonna be the instruction number of the instruction of the get time of day or whatever instruction that was executed on the primary. So the backup knows where to cause this event to occur. Okay, so there's gonna be a type, network input, whatever, a weird instruction. And then there's gonna be data for a packet arrival. It's gonna be the packet data for one of these weird instructions. It's gonna be the result of the instruction when it was executed on the primary. So that the backup virtual machine can sort of fake the instruction and supply that same result. Okay, so as an example, both of these, the operating systems, guest operating system assumes, requires that the hardware, in this case emulated hardware, virtual machine, has a timer that ticks say 100 times a second and causes interrupts to the operating system. And that's how the operating system keeps track of time. It's by counting these timer interrupts. So the way that plays out here, and so the timer interrupts, by the way, have to happen at exactly the same place in the primary and backup, otherwise they don't execute the same and they'll diverge. So what really happens is that there's a timer on the physical machine that's running the FT virtual machine monitor. And the timer on the physical machine ticks and delivers an interrupt, a timer in and up to the virtual machine monitor on the primary. The virtual machine monitor at the appropriate moment stops the execution of the primary, writes down the instruction number that it was at, the instruction since boot. And then delivers sort of fakes, emulates and interrupt into the guest operating system on the primary at that instruction number saying, oh, you're emulated timer hardware just ticked, there's the interrupt. And then the primary virtual machine monitor sends that instruction number, which the interrupt happened, to the backup. The backup, of course, its virtual machine monitor is also taking timer interrupts from its physical timer. And it's not giving them, it's not giving, it's a real physical timer interrupts to the backup operating system, it's just ignoring them. When the log entry for the primaries timer interrupt arrives here, then the backup virtual machine monitor will arrange with the CPU, and this requires special CPU support, to cause the physical machine to interrupt at the same instruction number, at the timer interrupt happened to the primary. At that point, the virtual machine monitor gets control again from the guest and then fakes the timer interrupt into the backup operating system. Now, I have exactly the same instruction number as it occurred on the primary. Well, yeah, so the observation is that this relies on the CPU having some special hardware in it, where the VMM can tell the hardware, the CPU, please interrupt 1,000 instructions from now. And then the VMM, so that it'll interrupt at the right instruction number, same instruction as the primary did. And then the VMM just tells the CPU to start resume executing again in the backup, and exactly 1,000 instructions later, the CPU will force an interrupt into the virtual machine monitor. And that's special hardware, but it turns out it's on all Intel chips, so it's not that special anymore. 15 years ago was exotic, now it's totally normal. And it turns out there's a lot of other uses for it, like if you wanna do profiling, you wanna do CPU time profiling, which you'd really like, or one way to do CPU time profiling is to have the microprocessor interrupt every 1,000 instructions, right? And this is the hardware that's, this hardware also. This is the same hardware that would cause the microprocessor to generate and interrupt every 1,000 instructions. So it's a very natural sort of gadget to want in your CPU. Yes. Say it again. Okay, so the question is, what if the backup gets ahead of the primary? So, you know, re-standing above know that, oh, you know, the primary is about to take and interrupt at the 1,000,000 instruction. But the backup is already, you know, executed the 1,000,000 and first instruction. So it's gonna be, if we let this happen, it's gonna be too late to deliver the interrupt. If we let the backup execute ahead of the primary, it's gonna be too late to deliver the interrupt at the same point in the primary instruction stream and the backup instruction stream. So we cannot let that happen. We cannot let the backup get ahead of the primary in execution. And the way VMware FT does that is that the backup virtual machine monitor actually keeps a buffer of waiting events that have arrived from the primary. And it will not let the backup execute unless there's at least one event in that buffer. And if there's one event in that buffer, then it will know from the instruction number, the place at which it's gotta force the backup to stop executing. So always, always, the backup is executing with the CPU being told exactly where the next stopping point, the next instruction number of a stopping point is, because the backup only executes if it has an event here that tells it where to stop next. So that means it starts up after the primary because the backup can't even start executing until the primary has generated the first event and that event has arrived at the backup. So the backup sort of always one event basically behind, at least one event behind the primary and if it's slower for some other whatever reason, maybe there's other stuff running on that physical machine, then the backup might get multiple events behind the primary. All right, there's one little piece of mess about arriving with a specific case of arriving packets. Ordinarily, when a packet arrives from a network interface card, if we weren't running a virtual machine, the network interface card would DMA the packet content into the memory of the computer that it's attached to sort of as the data arrives from the network interface card. And that means you should never write software like this, but it could be that the operating system that's running on the computer might actually see the data of a packet as it's DMAed or copied from the network interface card into memory. We don't know what operating this system is designed so that it can support any operating system and gosh, maybe there is an operating system that watches arriving packets in memory as they're copied into memory. So we can't let that happen because if the primary happens to be playing that trick, it's gonna see, if we allowed the network interface card to directly DMA the packets into the memory of the primary, the primary, we don't have any control over the exact timing of when the network interface card copies data into memory. And so we're not gonna know sort of at what times the primary did or didn't observe data from the packet arriving. And so what that means is that, in fact, the NIC copies incoming packets into private memory of the virtual machine monitor and then the network interface card interrupts the virtual machine monitor and says, oh, packet has arrived. At that point, the virtual machine monitor will suspend the primary and remember what instruction number it suspended it at. Copy the entire packet into the primary's memory while the primary is suspended and not looking at this copy and then emulate a network interface card interrupt into the primary. And then send the packet and the instruction number to the backup. The backup will also suspend the backup virtual machine monitor will spend the backup at that instruction number. Copy the entire packet in again to the backup is guaranteed not to be watching the data arrive and then fake and interrupt at the same instruction number as the primary. And this is the something, the bounce buffer mechanism explained in the paper. Okay, yeah, the only instructions that result in logging channel traffic are weird instructions which are rare. No, it's instructions that might yield a different result if executed on the primary and backup like instruction to get the current time of day or current processor number or ask how many instructions have been executed or and those actually turn out to be relatively rare. There's also one to get random to ask on some machines to ask or a hardware generated random number for cryptography or something. And, but those are not everyday instructions. Most instructions are like add instructions so they're gonna get the same result on primary and backup. Well, at the end I'm just gonna interpret those as a stream of network packets, right? So the way those get replicated on the backup is just by forwarding those network packets? That's exactly right. Each network packet just gets packaged up and forwarded as is as a network packet and is interpreted by the TCP IP stack on both. So I'm expecting 99.99% of the logging channel traffic to be incoming packets and only a tiny fraction to be results from special non-deterministic instructions. And so we can kind of guess what the traffic load is likely to be for a server that serves clients. Basically it's a copy of every client packet. And then we'll sort of know what the logging channel, how fast the logging channel has to be. All right, so it's worth talking a little bit about how output works. And in this system really the only, what output basically means only is sending packets. Clients send requests in as network packets. The response goes back out as network packets and there's really no other form of output. As I mentioned, both primary and backup compute the output packet they wanna send and that sort of asks the simulated NICs to send the packet is really sent on the primary and simply discarded the output packet discarded on the backup. Okay, but it turns out it's a little more complicated than that. So, supposing we're, what we're running is a some sort of simple database server and the operation, the client operation that our database server supports is increment. And the idea is the client sends an increment request. The database server increments the value and sends back the new value. So maybe on the primary, well let's say everything's fine so far and the primary and backup both have value 10 in memory and that's the current value of the counter. And some client on the local area network sends an increment request to the primary. That packet is delivered to the primary. It's executed the primary, the sort of server software on the primary says, oh, current value is 10. I'm gonna change it to 11 and send a response packet back to the client saying 11 as the reply. The same request is, as I mentioned, we're gonna, supposed to be sent to the backup will also be processed here. It's gonna change this 10 to 11 also, generate a reply and we'll throw it away. So that's what's supposed to happen with output. However, you also need to ask yourself, what happens if there's a failure at an awkward time? Like you should always in this class, you should always ask yourself, what's the most awkward time to have a failure and what would happen if a failure occurred then? So suppose the primary does indeed generate the reply here back to the client, but the primary crashes just after sending the reply to the client and furthermore and much worse, it turns out that, you know, this is just a network. It doesn't guarantee to deliver packets. Let's suppose this log entry on the logging channel got dropped also when the primary died. So now the state of play is, the client received a reply saying 11, but the backup did not get the client request, so its state is still 10. Now now the backup takes over because it sees the primary is dead and this client or maybe some other client sends an increment request to the new backup and now it's really processing these requests and so the new backup when it gets the next increment request, you know, it's now gonna change its state to 11 and generate a second 11 response, maybe to the same client, maybe to a different client, which if the client's compare notes or if it's the same client, it's just obviously cannot have happened, right? And so, you know, because we have to support unmodified software that does not understand that there's any funny business of replication going on, that means we do not have the opportunity to, you know, you can imagine the client would go, you know, we could change the client to realize something funny had happened with the fault tolerance and do I don't know what, but we don't have that option here because this whole system really only makes sense if we're running unmodified software. So this was a big, this is a disaster. We can't have let this happen. Does anybody remember from the paper how they prevent this from happening? The output rule, yeah. So you wanna, you know, the output rules is their solution to this problem. And the idea is that the client is not allowed to generate, you know, generate any output, the primary is not allowed to generate any output and what we're talking about now is this output here until the backup acknowledges that it has received all log records up to this point. So the real sequence at the primary then, let's now uncrash the primary, go back to them starting at 10, the real sequence now with the output rule is that the input arrives, at the time the input arrives, that's when the virtual machine monitor sends a copy of the input to the backup. So the sort of time at which this log message with the input is sent is strictly before the primary generates the output, sort of obvious. Then after firing this log entry off across the network and now it's heading towards the backup, but you know, might have been lost, might not, the virtual machine monitor delivers a request to the primary service software, it generates the output, so now the replicated, you know, the primary has actually generated, changed the state to 11 and generated an output packet that says 11, but the virtual machine monitor says, oh, wait a minute, we're not allowed to generate that output until all previous log records have been acknowledged by the backup. So, you know, this is the most recent previous log message. So this output is held by the virtual machine monitor until this log entry containing the input packet from the client is delivered to the virtual machine monitor and buffered by the virtual machine monitor, but they're not necessarily executed. They may be just waiting for the backup to get to that point in the instruction stream. And then the virtual machine monitor here will send an ACA packet back saying, yes, I did get that input. And when the acknowledgement comes back, only then will the virtual machine monitor here release the packet out onto the network. And so the idea is that if the client could have seen the reply, then necessarily the backup must have seen the request and at least buffered it. And so we no longer get this weird situation in which a client can see a reply, but then there's a failure and a cut over and the replica didn't know anything about that reply. If the, you know, there's also this situation, maybe this message was lost and if this log entry was lost and then the primary crashes, well, since it hadn't been delivered so the backup hadn't sent the ACA, that means if the primary crashed, you know, if this log entry was dropped and the primary crashed, it must have crashed before the virtual machine monitor released the output packet and therefore this client couldn't have gotten the reply. And so it was not in a position to spot any irregularities. Everybody happy with the output rule? Yes. It maintains its own data structures and so on. I'm wondering how lower or high level the code in there is in what language is usually implemented? Sprint in C. No, I don't know. I don't, the paper doesn't mention how the virtual machine monitor is implemented. I mean, it's pretty low level stuff because, you know, it's sitting there allocating memory and configuring page tables and talking to device drivers and intercepting instructions and understanding what instructions the guest was executing. So we're talking about low level stuff. What language is written in, you know, traditionally C or C++, but I don't actually know. Okay, this of the primary has to delay at this point, waiting for the backup to say that it's up to date. This is a real performance thorn in the side of just about every replication scheme. This sort of synchronous wait where the, we can't let the primary get too far ahead of the backup because if the primary failed while it was ahead, that would leave the backup lagging, lagging behind clients, right? So just about every replication system has this problem that at some point the primary has to stall waiting for the backup and it's a real limit on performance. Even if the machines are like side by side in adjacent racks, it's still, you know, we're talking about a half a millisecond or something to send messages back and forth where the primary is stalled. And if we wanna like withstand earthquake, or city-wide power failures, you know, the primary and the backup have to be in different cities, that's probably five milliseconds apart. So every time we produce output, if we replicate in the two replicas in different city, every packet that it produces as output has to first wait the five milliseconds or whatever to have the last log entry get to the backup and have the acknowledgement come back and then we can release the packet. And you know, for sort of low intensity services, that's not a problem. But if we're building a database server that we would like to, you know, that if it weren't for this, could process millions of requests per second, then that's just unbelievably damaging for performance. And this is a big reason why people, if they possibly can, use a replication scheme that's operating at a higher level and kind of understands the semantics of the operations and so it doesn't have to stall on every packet. You know, it could stall on every high level operation or even notice that well, you know, read-only operations don't have to stall at all. It's only writes that have to stall or something. But you have to, it has to be an application level replication scheme to realize that. Yeah, you're absolutely right. So the observation is that you don't have to stall the execution of the primary, you only have to hold the output. And so yeah, maybe that's not as bad as it could be. But nevertheless, it means that every, you know, in a service that could otherwise have responded in a couple of microseconds to the client. You know, if we have to first update the replica in the next city, we turn to, you know, a 10 microsecond interaction into a 10 millisecond interaction, possibly. If you have vast numbers of clients submitting concurrent requests, then you may be able to maintain high throughput even with high latency. But you have to be lucky to, or a very clever designer to get that. Mitted an output and then let's say that the primary did die before the backup to consume all the messengers up the log. That's a great idea. But if you log in the memory of the primary, that log will disappear when the primary crashes. Or that's the usual semantics of a server failing is that you lose everything inside the box, like the contents of memory. Or even if you didn't, if the failure is that somebody unplugged the power cable accidentally from the primary, even if the primary just has battery backed up RAM or I don't know what, you can't get at it. The backup can't get at it. So in fact, this system does log the output and the place it logs it is in the memory of the backup. And in order to reliably log it there, you have to observe the output rule and wait for the acknowledgement. So it's entirely correct idea. Just can't use the primary's memory for it. Yes. Say that again. That's a clever idea. So the question is maybe input should go to the primary that output should come from the backup. I completely haven't thought this through. That might work, but I don't know. That's interesting. Yeah, maybe that will work. Okay. One possibility this does expose though is that situation, maybe the primary crashes after its output is released, so the client does receive the reply, then the primary crashes, the backup's input is still in this event buffer in the virtual machine monitor of the backup. It hasn't been delivered to the actual replicated service. When the backup goes live after the crash of the primary, the backup first has to consume all of the sort of log records that are lying around that it hasn't consumed yet because it has to catch up to the primary. Otherwise it won't take over with the same state. So before the backup can go live, that actually has to consume all these entries. The last entry is presumably is the request from the client. So the backup will be live after the interrupt that delivers the request from the client. And that means that the backup will increment its counter to 11 and then generate an output packet. And since it's live at this point, it will generate the output packet and the client will get two 11 replies, which is also, if that really happened, would be anomalous. Like possibly not something that could happen if there was only one server. The good news is that almost certainly, or almost certainly the client is talking to this service using TCP and this is the request and the response go back and forth on a TCP channel. When the backup takes over, the backup since it's state is identical to the primaries. It knows all about that TCP connection and what all the sequence numbers are and whatnot. And when it generates this packet, it will generate it with the same TCP sequence number as the original packet and the TCP stack on the client will say, oh wait a minute, that's a duplicate packet and will discard the duplicate packet at the TCP level and the user level software will just never see this duplicate. And so this system really, you can view this as a kind of accidental or clever trick, but the fact is for any replication system where cutover can happen, which is to say pretty much any replication system, it's essentially impossible to design them in a way that they are guaranteed not to generate duplicate output. Basically, well, you can err on either side of not, you can either not generate the output at all, which would be bad, which would be terrible, or you can generate the output twice on a cutover, but there's basically no way to generate it, be guaranteed to generate it only once. Everybody errs on the side of possibly generating duplicate output and that means that at some level, the client side of all replication schemes needs some sort of duplicate detection scheme. Here we get to use TCPs, if we didn't have TCP, there would have to be something else, maybe application level, sequence numbers, or I don't know what. And you'll see all of this, and actually you'll see versions of essentially everything I've talked about, like the output rule, for example, in labs two and three. You design your own replicating state machine, yes. Yes to the first part. So the scenario is the primary sends a reply and then either the primary sends a close packet or the client closes the TCP connection after it receives the primary's reply. So now there's like no connection on the client side, there is a connection on the backup side. And so now the backup, so the backup consumes the very last log entry that has the input, it is now live. So we're not responsible for replicating anything at this point, right? Because the backup's now live, there's no other replica, this is the primary died. So there's no, like if we don't, if the backup fails to execute and locks up with the primary, that's fine actually, because the primary is dead and we do not wanna execute and lock step with it. Okay, so the primary's now not it's live. It generates an output on this TCP connection that isn't closed yet from the backup point of view. This packet arrives at the client on the TCP connection that doesn't exist anymore from the client's point of view. Like no big whoopie on the client, right? It's just gonna throw away the packet as if nothing happened, the application won't know. The client may send a reset, something like a TCP error or whatever packet. Back to the backup and the backup does something to rather with it, but it doesn't matter because we're not diverging from anything because there's no primary to diverge from. You can just handle a stray reset however likes and what it'll in fact do is basically ignore it. But there's no, now the backup has gone live, there's just no, we don't owe anybody anything as far as replication, yeah. So I'm wondering to effectively view the primary for the client? Well you can bet since the backup's memory image is identical to the primary's image that they're sending packets with the very same source TCP number and the very same everything. They're sending bit for bit identical packets. You know at this level the servers don't have IP addresses or for our purposes the virtual machines, you know the primary and the backup virtual machines have IP addresses, but the physical computer and the VMM are transparent to the network. It's not entirely true, but it's basically the case that the virtual machine monitor and the physical machine don't really have identity of their own on the network. At least you can configure them that way. Instead the virtual machine with its own operating system and its own TCP stack, it has an IP address and an Ethernet address and all this other stuff which is identical between the primary and the backup and when it sends a packet it sends it with the virtual machine's IP address and Ethernet address and those bits, at least in my mental model are just simply passed through onto the local area network. Which is exactly what we want. And so they help generate exactly the same packets that the primary would have generated. There's maybe a little bit of trickery, you know if these are actually plugged into an Ethernet switch, the physical machines may be plugged into different ports of an Ethernet switch and we like the Ethernet switch to change its mind about which of these two machines that delivers packets with the replicated services Ethernet address and so there's a little bit of funny business there. For the most part they're just generating identical packets and we just send them out. Okay, so another little detail I've been glossing over is that I've been assuming that the primary just fails or the backup just fails, that it's fail stop, right? But that's not the only option. Another very common situation that has to be dealt with is if the two machines are still up and running and executing but there's something funny happened on the network that causes them not to be able to talk to each other but to still be able to talk to some clients. So if that happened, if the primary and backup couldn't talk to each other but they could still talk to clients, they would both think, oh the other replica's dead, I better take over and go live. And so now we have two machines going live at this service and now they're no longer sending each other log events or anything, they're just diverging. Maybe they're accepting different client inputs and changing their state in different ways. So now we have a split brain disaster if we let the primary and the backup go live because it was a network that has some kind of failure instead of these machines. And the way that this paper solves it I mean is by appealing to an outside authority to make the decision about which of the primary or the backup is allowed to be live. And so it turns out that their storage is actually not on local disk. This almost doesn't matter but their storage is on some external disk server and as well as being a disk server as a totally separate service, it has nothing to do with disks, their disk server happens to export this test and set test and set service over the network where you can send a test and set request to it and there's some flag that's keeping in memory and it'll set the flag in return what the old value was. So both primary and backup have to sort of acquire this test and set flag, it's a little bit like a lock in order to go live. They both maybe send test and set requests at the same time to this test and set server. The first one gets back a reply that says oh the flag used to be zero, now it's one. The second request to arrive, the response from the test and set server is oh actually the flag was already won when your request arrived, so basically you're not allowed to be primary. And so this test and set server and we can think of it as a single machine is the arbitrator that decides which of the two should go live if they both think the other one's dead due to a network partition. Any questions about this mechanism? You're busted. Yeah, the test and set server should be dead at the critical moment when, and so actually even if there's not a network partition under all circumstances in which one or the other of these wants to go live because it thinks the other's dead, even when the other one really is dead, the one that wants to go live still has to acquire the test and set lock because one of the deep rules of the 6824 game is that you cannot tell whether another computer is dead or not. All you know is that you stopped receiving packets from it and you don't know whether it's because the other computer is dead or because something has gone wrong with the network between you and the other computer. So all the backup sees as well I've stopped getting packets. Maybe the primary's dead, maybe it's alive. Primary probably sees the same thing. So if there's a network partition, they certainly have to ask the test and set server but since they don't know if it's a network partition they have to ask the test and set server regardless of whether it's a partition or not. So anytime either wants to go live, the test and set server also has to be alive because they always have to acquire this test and set lock. So the test and set server sounds like a single point of failure that we're trying to build a replicated fault tolerant, whatever thing, but in the end, we can't fail over unless this is alive. So that's a bit of a bummer. I'm guessing though, I'm making a strong guess that the test and set server is actually itself a replicated service and is fault tolerant, right? Because almost certainly, I mean, these people are being aware, they're like happy to sell you a million dollar highly available storage system that uses enormous amounts of replication internally. Since the test and set thing is on their disk server, I'm guessing it's replicated too. And the stuff you'll be doing in lab two and lab three is more than powerful enough for you to build your own fault tolerant test and set server. So this problem can easily be eliminated.