 I'd like to talk about Farm's Day and Optimistic Concurrency Control, which is the main interesting technique that it uses. The reason we're talking about Farm, it's just the last paper in the series about transactions and replication and sharding. And this is still an open research area where people are totally not satisfied with performance or the kind of performance versus consistency trade-offs that are available and they're still trying to do better. And in particular, this particular paper is motivated by the huge performance potential of these new RDMA NICs. So you may be wondering, since we just read about Spanner, how Farm differs from Spanner. Both of them, after all, replicate and they use two-phase commit for transactions. So at that level, they seem pretty similar. Spanner is a deployed system. It's been used a lot for a long time. Its main focus is on geographic replication. That is to be able to have copies on the east and west coasts and different data centers and be able to have reasonably efficient transactions that involve pieces of data in lots of different places. And the most innovative thing about it, because in order to try to solve the problem of how long it takes to do two-phase commit over long distances, is that it has a special optimization path for read-only transactions using synchronized time. And the performance you get out of Spanner, if you remember, is that a read-write transaction takes 10 to 100 milliseconds, depending on how close together the different data centers are. Farm makes a very different set of design decisions and targets a different kind of workload. First of all, it's a research prototype. So it's not by any means a finished product. And the goal is to explore the potential of these new RDMA high-speed networking hardware. So it's really still an exploratory system. It assumes that all replicas are in the same data center. It absolutely wouldn't make sense if the replicas were even in different data centers, let alone on east coast versus west coast. So it's not trying to solve the problem that Spanner is of, oh, what happens if an entire data center goes down? Can I still get out my data? Really, that's to the extent that it has fault tolerance is for individual crashes or maybe try to recover after a whole data center loses power and gets restored. Again, it uses this RDMA technique, which I'll talk about. But RDMA turns out to seriously restrict the design options. And because of this, Farm is forced to use optimistic concurrency control. On the other hand, the performance they get is far, far higher than Spanner. Farm can do a simple transaction in 58 microseconds. And this is from figure 7 in section 6.3. So this is 58 microseconds versus the 10 milliseconds that the Spanner takes. So that's about 100 times faster than Spanner. So that's maybe the main huge differences that Farm has much higher performance but is not aimed at geographic replication. So this Farm's performance is extremely impressive, much faster than anything else. Another way to look at it is that Spanner and Farm target different bottlenecks. And Spanner, the main bottleneck that people are worried about, is the speed of light and network delays between data centers. Whereas in Farm, the main bottleneck that the design is worried about is CPU time on the servers, because they wish away the speed of light and network delays by putting all the replicas in the same data center. All right, so that's sort of the background of how this fits into the 6824 sequence. The setup in Farm is that it's all running in one data center. There's sort of configuration managers, which we've seen before. And the configuration manager is in charge of deciding which servers should be the primary and the backup for each shard of data. And if you write carefully, you'll see that they use ZooKeeper in order to help them implement this configuration manager. But it's not the focus of the paper at all. Instead, the interesting thing is that the data is sharded and split up by key across a bunch of primary backup payers. So maybe one shard goes on primary one, server primary one, backup one, another shard goes on primary two, backup two, and so forth. And that means that any time you update data, you need to update it both on the primary and on the backup. And these replicas are not maintained by PAXOS or anything like it instead. All the replicas of the data are updated whenever there's a change. And if you read, you always have to read from the primary. The reason for this replication, of course, is fault tolerance. And the kind of fault tolerance they get is that as long as one replica of a given shard is available, then that shard will be available. So they only require one living replica, not a majority. And the system as a whole, if there's, say, a data center-wide power failure, it can recover as long as there's at least one replica of every shard in the system. Another way of putting that is if they have F plus 1 replica, then they can tolerate up to F failures for that shard. In addition to the primary backup copies of each shard of data, there's transaction code that runs. It's maybe most convenient to think of the transaction code as running at separate clients. In fact, they run the transaction code in their experiments on the same machines as the actual farm storage servers. But I'll mostly think of them as being a separate set of clients. And the clients are running transactions. And the transactions need to read and write data objects that are stored in the sharded servers. In addition, each client not only runs the transactions, but also acts as the transaction coordinator for a two-phase commit. OK, so it's the basic setup. The way they get performance, because this is really a paper all about how you can get high performance and still have transactions, one way they get high performance is with sharding. So these are the ingredients. In a sense, the main way is through sharding. In experiments, they shard their data over 90 ways, over 90 servers, or maybe it's 45 ways. And that just, as long as the operations on different shards are more or less independent of each other, that just gets you an automatic 90-time speedup, because you can run whatever it is you're running in parallel on 90 servers. So it's a huge one from sharding. Another trick they play in order to get good performance is the data all has to fit in the RAM of the servers. They don't really store the data on disk. It all has to fit in RAM. And that means, of course, you can get at it very quickly. Another way that they get high performance is they need to tolerate power failures, which means that they can't just be using RAM, because they need to recover the data after a power failure. And RAM loses its contents on a power failure. So they have a clever, non-volatile RAM scheme for having the contents of RAM that data survive power failures. This is in contrast to storing the data persistently on disk as much faster than disk. Another trick they play is they use this RDMA technique, which is essentially clever network interface cards that allow, that accept packets that instruct the network interface card to directly read and write the memory of the server without interrupting the server. Another trick they play is what's often called kernel bypass, which means that the application level code can directly access the network interface card without getting the kernel involved. OK, so these are all the sort of clever tricks we're looking out for that they used to get high performance. And we've already talked about sharding a lot, but I'll talk about the rest in this lecture. OK, so first, I'll talk about non-volatile RAM. And this is really a topic that doesn't really affect the rest of the design directly. As I said, all the data and for farms stored in RAM, when a client transaction updates a piece of data, what that really means is that it reaches out to the relevant servers that store the data and causes those servers to modify whatever object is. The transaction is modifying it right in RAM. That's as far as the rights get, they don't go to disk. And this is in contrast to your raft implementations, for example, which spent a lot of time persisting data to disk. There's no persisting in the farm. And this is a big win to writing stuff in RAM. A write to RAM takes about 200 nanoseconds, whereas a write, even to a solid state drive, which is pretty fast, a write to a solid state drive takes about 100 microseconds. And a write to a hard drive takes about 10 milliseconds. So being able to write to RAM is worth many, many orders of magnitude and speed for transactions that modify things. But of course, RAM loses its content and a power failure. So it's not persistent by itself. As a side, you might think that writing modifications to the RAM of multiple servers, if you have replica servers and you update all the replicas, that that might be persistent enough. Since after all, if you have F plus one replicas, you can tolerate up to F failures. And the reason why just simply writing to RAM on multiple servers is not good enough is that a site-wide power failure will destroy all of your servers, and thus violating the assumption that the failures are and different servers are independent. So we need a scheme that's going to work, even if power fails to the entire data center. So what Farm does is it puts a big battery in every rack and runs the power supply system through the batteries. So the batteries automatically take over if there's a power failure and keep all their machines running, at least until the battery fails. But of course, the battery is not very big. It may only be able to run their machines for, say, 10 minutes or something. So the battery by itself is not enough to make the system be able to withstand a lengthy power failure. So instead, the battery system, when it sees that the main power has failed, the battery system, while it keeps the servers running, also alerts the servers, all the servers, and with some kind of interrupt or message telling them, look, the power's just failed. You only got 10 minutes left before the batteries fail also. So at that point, the software on Farm's servers copies all of RAM, actually stops all processing for Farm first, and then copies, each server copies all of its RAM to a solid state drive attached to that server, which could take a couple of minutes. And once all the RAM is copied to the solid state drive, then the machine shuts itself down and turns itself off. So if all goes well, there's a site-wide power failure, all the machines save their RAM to disk. When the power comes back up in the data center, all the machines will, when they reboot, will read the memory image that was saved on disk, restored into RAM. And there's some recovery that has to go on, but basically they won't have lost any of their persistent state due to the power failure. And so what that really means is that the Farm is using conventional RAM, but it's essentially made the RAM non-volatile, able to survive power failures with this trick of using a battery, having the battery alert the server, having the server store the RAM content to solid state drives. Any questions about the NV RAM scheme? All right, this is a useful trick, but it is worthwhile keeping in mind that it really only helps if there's power failures. That is if the whole sequence of events only gets set in train when the battery notices that the main power has failed. If there's some other reason causing the server to fail, like something goes wrong with the hardware, or there's a bug in the software that causes a crash, those crashes, the non-volatile RAM system has just nothing to do with those crashes. Those crashes will cause the machine to reboot and lose the contents of its RAM, and it won't be able to recover them. So this NV RAM scheme is good for power failures, but not other crashes. And so that's why in addition to the NV RAM farm also has multiple copies, multiple replicas of each shard. All right, so this NV RAM scheme essentially eliminates persistence rights as a bottleneck in the performance of the system, leaving only as performance bottlenecks the network and the CPU, which is what we'll talk about next. Okay, so there's a question. If the data center power fails and farm moves everything to solid state drive, would it be possible to carry all the data to a different data center and continue operation there? In principle, absolutely. In practice, I think it would all certainly be easier to restore power to the data center than to move the drives. The problem is there's no power in the power in the old data center. So you'd have to physically move the drives and the computers, maybe just the drives to the new data center. So this was, if you wanted to do this, it might be possible, but it's certainly not, it's not what the farm designers had in mind, they assumed the power would be restored. Okay, so that's NV RAM. And at this point, we can just ignore NV RAM for the rest of the design. It doesn't really interact with the rest of the design except that we nowhere have to worry about writing data to disk. All right, so as I mentioned, the remaining bottlenecks, once you eliminate having to write data to disk for persistence, the remaining bottlenecks have to do with the CPU and the network. And in fact, in farm and indeed a lot of the systems that I've been involved with, the, a huge bottleneck has been the CPU time required to deal with network interactions. So network and CPU are kind of joint bottlenecks here. Farm doesn't have any kind of speed of light network problems. It just has the problems or it just spends a lot of time eliminating bottlenecks having to do with getting network data into and out of the computers. So first as a background, I wanna lay out what the conventional architecture is for getting things like remote procedure call packets between applications and on different computers. Just so that we have an idea of why the approach that farm takes is more efficient. So typically what's going on is on one computer that maybe wants to send a remote procedure call message, you might have an application and then the application is running in user space. There's a user kernel boundary here. The application makes system calls into the kernel which are not particularly cheap in order to send data. And then there's a whole stack of software inside the kernel involved is sending data over the network. There might be what's usually called a socket layer that does buffering, which involves copying the data, which takes time. There's typically a complex TCP protocol stack that knows all about things like retransmitting and sequence numbers and checksums and flow control. There's quite a bit of processing there. At the bottom, there's a piece of hardware called the network interface card, which has a bunch of registers that the kernel can talk to to configure it and it has hardware required to send bits out over the cable onto the network. And so there's some sort of network interface card driver in the kernel. And then all self-respecting network interface cards use direct memory access to move packets into and out of host memory. So there's going to be things like cues of packets that the network interface card has de-made into memory waiting for the kernel to read and outgoing cues of packets that the kernel would like the network interface to send as soon as convenient. All right, so you want to send a message like an RPC request as to go down from the application to the stack. Network interface card sends the bits out on a cable. And then there's the reverse stack on the other side. There's a network interface hardware here in the kernel. The network interface might interrupt the kernel. Runs driver code, which hands packets to the TCP protocol which writes them into buffers waiting for the application to read them. At some point the application gets around reading them, makes system calls into the kernel, copies the data out of these buffers into user space. This is a lot of software, it's a lot of processing and a lot of fairly expensive CPU operations like system calls and interrupts and copying data. As a result, sort of classical network communication is relatively slow. It's quite hard to build an RPC system with the kind of traditional architecture that can deliver more than say a few hundred thousand RPC messages per second. And that might seem like a lot but it's orders of magnitude too few for the kind of performance that farm is trying to target. And in general that couple hundred thousand RPCs per second is far, far less than the speed that the actual network hardware like the network wire and the network interface card is capable of. Typically these cables run at things like 10 gigabits per second. It's very, very hard to write RPC software that can generate small messages of the kind that databases often need to use. It's very hard to write software in this style that can generate or absorb anything like 10 gigabits per second of messages. That's millions, maybe tens of millions of messages per second. Okay, so this is the plan that farm doesn't use and it's sort of a reaction to this plan. Instead farm uses two ideas to reduce the costs of pushing packets around. The first one I'll call kernel bypass and the idea here is that instead of the application sending all it's data down through a complex stack of kernel code, instead the application, the kernel configures the protectors machinery in the computer to allow the application direct access to network interface card. So the application can actually reach out and touch the network interfaces registers and tell it what to do. In addition, the network interface card when it DMAs in this kernel bypass scheme, it DMAs directly into application memory where the application can see the bytes arriving. Directly without kernel intervention and when the application needs to send data, the application can create cues that the network interface card can directly read with DMA and send out over the wire. So we've completely eliminated all the kernel code involved in networking. Kernels is not involved. There's no system calls. There's no interrupts. The application just directly reason rights memory that the network interface card sees. Of course, the same thing on the other side. And this is an idea that is actually was not possible years ago with network interface cards, but most modern serious network interface cards can be set up to do this. It does however require that the application, you know, all those things that TCP was doing and all those things that TCP was doing for you, like checksums or retransmission. The application would now be in charge if you wanted to do this. You can actually do this yourself. Kernel bypass using a toolkit that you can find on the web called DPDK and it's relatively easy to use and allows people to write extremely high performance. Networking applications. But so form does use this. It's applications directly talk to the neck. The neck DMA things right into application memory. So we have a student question. I'm sorry. Yes. Does this mean that farm machines run a modified operating system? Well, actually, I don't know the answer to that. I don't know the answer to that. Well, actually, I don't know the actual answer to that question. I believe farm is runs on windows, some form of windows, whether or not they had to modify windows. I do not know. In the sort of Linux world, in Linux where there's already full support for this, it does require kernel intervention because the kernel has to be willing to give or ordinarily application code cannot do anything directly with devices. So Linux has had to be modified to allow the kernel to delegate hardware access to applications. So it does require kernel modifications. Those modifications are already in Linux and maybe already in Windows also. In addition, this depends on fairly intelligent Nix because of course you're going to have multiple applications that want to play this game with a network interface card. And so modern Nix actually know about talking to multiple distinct queues so that you can have multiple applications each with its own set of queues and the NIC knows about them. So it has required modification of a lot of things. Okay, so sort of step one is kernel bypass idea. Step two is even clever Nix and now we're starting to get into hardware and not in wide use at the moment. You can buy it commercially, but it's not the default. It is this RDMA scheme which is remote direct memory access. And here, this is a sort of special kind of network interface cards that support remote, support RDMA. So now we have an RDMA NIC and both sides have to have these special network interface cards. So I'm drawing these connected by a cable. In fact, always there's a switch here that has connections to many different servers and allow sort of any server to talk to any server. Okay, so we have these RDMA NICs and we again, we have the applications and the application is this memory. And now though, the application can essentially send a special message through the NIC that asks, so we have an application on the source host and maybe we can call this the destination host, can send a special message through the RDMA system that tells this network interface card to directly read or write some bytes of memory, probably a cache line of memory in the target applications address space directly. So hardware and software on the network interface controller are doing a read and write, read or write of the target applications memory directly. And then, so we have a sort of request going here that causes the read or write and then sending the result back to really to a incoming queue on the source application. And the cool thing about this is that this computers, the CPU, this application didn't know anything about the reader write. The reader write is executed completely in firmware in the network interface card. So it's not, there's no interrupts here. The application didn't have to think about the request or think about replying the request. The network interface card just reads or writes a memory and sends a result back to the source application. And this is much, much lower overhead way of getting at all you need to do is read or write memory and stuff in the RAM of the target application. This is a much faster way of doing a simple read or write than sending an RPC call even with magic kernel bypass networking. Does this mean that RDMA always requires kernel bypass to work at all? You know, I don't know the answer to that. I think I've only ever heard it used in conjunction with kernel bypass because the people who are interested in any of this are interested in it only for tremendous performance. And I think you would waste, you throw away a lot of the performance. I'm guessing you would throw away a lot of the performance when if you had to send the request through the kernel. Okay, another question. The question notes that the TCP software, TCP supports in order delivery, duplicate detection and a lot of other excellent properties which you actually need. And so it would actually be extremely awkward if this setup sacrificed reliable delivery or in order delivery and so the answer to the question is actually these RDMA NICs run their own reliable sequenced protocol that's like TCP although not TCP between the NICs. And so when you ask your RDMA NIC to do a read or write, it'll keep retransmitting until if the request is lost and keep retransmitting until it gets a response and it actually tells the originating software if it did the request succeed or not. So you get an acknowledgement back finally. So yeah, you don't in fact have to sacrifice most of TCP's good properties. Now this stuff only works over a local network. I don't believe RDMA would be satisfactory between distant data centers. So this is all tuned up for very low speed of light access. And that's basically what I've just mentioned. One application uses RDMA to read or write the memory of another. That's one sided RDMA. Now in fact, farm uses RDMA to send messages in an RPS and then send messages in an RPS and then send messages in an RPS and then send messages in an RPS and then send messages in an RPC like protocol. So in fact, sometimes farm directly reads with one sided RDMA, but sometimes what farm is using RDMA for is to append a message to an incoming message queue inside the target. So sometimes what the, well actually always with writes what farm is actually doing is using RDMA to write to append a new message to an incoming queue in the target, which the target will pull since there's no interrupts here, the way target, the way the destination of a message like this knows about the messages that periodically checks one of these queues in memory to see, oh, I've gotten a recent message from anyone. Okay, so one sided RDMA is just to read or write, but using RDMA to send a message or append either to a message queue or to a log. Sometimes farm log entries to a log in another server also uses RDMA. And this memory that's being written into is all non-volatile. So all of it, the message queues, it's all written to disk if there's a power failure. The performance of this is the figure two shows that you can get 10 million small RDMA reads and writes per second, which is fantastic. Far, far faster than you can send messages like RPCs using TCP. And the latency of using RDMA to do a simple read or write is about five microseconds. So again, this is, you know, very, very short five microseconds. It's slower than accessing your own local memory, but it's faster than sort of anything else people do in networks. Okay, so this is sort of the promise. There's this fabulous RDMA technology that came out a while ago that the farm people wanted to exploit. You know, the coolest possible thing that you could imagine doing with this is using RDMA, one-sided RDMA reads and writes to directly do all the reads and writes records stored in database servers memory. So wouldn't it be fantastic if we could just never talk to the database servers, CPU or software, but just get at the data that we need, you know, in five microseconds of pop using direct one-sided RDMA reads and writes. So in a sense, this paper is about, you know, you start there, what do you have to do to actually build something useful? So an interesting question, by the way, is could you in fact implement transactions only using one-sided RDMA? That is, you know, anytime you want it to read or write data in the server, you only use RDMA and never actually send messages that have to be interpreted by the server software. It's worth thinking about. In a sense, farm is answering that question with a no, because that's not really how farm works. But it's absolutely worth thinking how come pure one-sided RDMA couldn't be made to work? All right, so the challenges to using RDMA in a transactional system that has replication and sharding, that's the challenge we have is how to combine RDMA with transactions sharding and replication because you need to have sharding and transactions replication to have a seriously useful database system. It turns out that all the protocols we've seen so far for doing transactions replication require active participation by the server software. That is, the server has to be, in all the protocols we've seen so far, the servers actively involved in helping the clients get at, read or write the data. So for example, in the two-phase commit schemes we've seen, the server has to do things like decide whether a record is locked and if it's not locked, set the lock on it, right? It's not clear how you could do that with RDMA. The server has to do things like in Spanner, there's all these versions, it was the server that was thinking about how to find the latest version. Similarly, if we have transactions in two-phase commit, data on the server, it's not just data, there's committed data, there's data that's been written but hasn't committed yet. And again, traditionally, it's the server that sorts out whether data, recently updated data is committed yet and has to sort of protect the clients from, prevent them from seeing data that's locked or not yet known to be committed. And what that means is that without some clever thought, RDMA or one-sided pure use of RDMA, one-sided RDMA, doesn't seem to be immediately compatible with transactions and replication. And indeed, while FARM does use one-sided reads to get that directly at data in the database, it is not able to use one-sided writes to modify the data. Okay, so this leads us to optimistic concurrency control. It turns out that the main trick in a sense that FARM uses to allow it to both use RDMA and get transactions is by using optimistic concurrency control. So if you remember, I mentioned earlier that concurrency control schemes are kind of divided into two broad categories, pessimistic and optimistic. Pessimistic schemes use locks. And the idea is that if you have a transaction that's going to read or write some data before you can read or write the data or look at it at all, it must acquire a lock and it must wait for the lock. And so you read about two-phase locking, for example, in that reading from 6.033. So before you use data, you have to lock it. And you hold the lock for the entire duration of the transaction. And only if the transaction commits or aborts, do you release the lock. And if there's conflicts, because two transactions want to write the same data at the same time or one wants to read and one wants to write, they can't do it at the same time. One of them has to block and wait for the lock to be released. And of course, this locking scheme is the fact that the data has to be locked and that somebody has to keep track of who owns the lock and when the lock is released, et cetera. This is one thing that makes RDMA, it's not clear how you can do writes or even reads using RDMA in a locking scheme because somebody has to enforce the locks. I'm being a little tentative about this because I suspect that with more clever RDMA nicks that could support a wider range of operations like atomic test and set, you might someday be able to do a locking scheme with pure one-sided RDMA. But Farm doesn't do it. Okay, so what Farm actually uses is an optimistic scheme. And here, in an optimistic scheme, you can use, at least you can read without locking. You just read the data. You don't know yet whether you are allowed to read the data or whether somebody else is in the middle of modifying it or anything, you just read the data and the transaction uses whatever it happens to read. And you also don't directly write the data in optimistic schemes, instead you buffer it. So you buffer writes locally in the client until the transaction finally ends. And then when the transaction finally finishes and you want to try to commit it, there's a valid date, what's called a validation stage in which the transaction processing system tries to figure out whether the actual reason rights you did were consistent with serializability. That is, they try to figure out, oh, was somebody writing the data while I was reading it? And if they were, boy, we can't commit this transaction because it computed with garbage instead of consistent read values. And so if the validation succeeds, then you commit. And if the validation doesn't succeed, if you detect somebody else was messing with the data while you were trying to use it at abort. So that means that if there's conflicts, if you're reading or writing data and some other transactions also modifying it at the same time, optimistic schemes abort at that point because the computation is already incorrect at the commit point. That is, you already read the damage data you weren't supposed to read. So there's no way to, for example, block until things are okay. Instead, the transaction is already kind of poisoned and just has to abort and possibly retry. Okay, so farm uses optimistic because it wants to be able to use one-sided RDMA to just read whatever's there very quickly. So this design was really forced by use of RDMA. This is often abbreviated OCC for optimistic concurrency control. All right, and then the interesting thing in optimistic concurrency control protocols is how validation works. How do you actually detect that somebody else was writing the data while you were trying to use it? And that's actually mainly going to be what I talk about in the rest of this lecture. Just again though, just to tie this back to the top level of the design. What this is doing for farm is that the reads can use one-sided RDMA because, and therefore be extremely fast because we're going to check later whether the reads were okay. All right, farms are research prototype that doesn't support things like SQL and supports a fairly simple API for transactions. This is the API just to give you a taste for what a transaction code might actually look like. If you have a transaction, it's got to declare the start of the transaction because we need to say, oh, this particular set of reads and writes needs to occur as a complete transaction. The code declares a new transaction by calling txcreate. This is all laid out, by the way, in a paper I think from 2014, a slightly earlier paper by the same authors. You create a new transaction and then you explicitly read those functions to read objects and you have to supply an object identifier, an OID indicating what object you want to read. Then you get back some object and you can modify the object in local memory. Read it and write it. You have a copy of it that you've read from the server, the tx read back from the server. You might increment some field in the object and then when you want to update an object, you call this txwrite. Again, you give it the object ID and the new object contents. Finally, when you're through with all of this, you've got to tell the system to commit this transaction, actually do validation and if it succeeds, cause the writes to really take effect and be visible and you call this commit routine. The commit routine runs a whole bunch of stuff, figure four, which we'll talk about, and it returns this okay value and it's required to tell the application, oh, did the commit succeed or was it aborted? We need to return this okay return value to correctly indicate whether the transaction succeeded. Okay, there's some questions. One is a question since OCC aborts, if there's contention, question is whether retries involve exponential back off because otherwise it seems like if you just instantly retried and that there were a lot of transactions all trying to update the same value at the same time, they'd all abort, they'd all retry and waste a lot of time. And I don't know the answer to that question. I don't remember seeing them mentioning exponential back off in the paper, but it would make a huge amount of sense to delay between retries and to increase the delay, to give somebody a chance of succeeding. This is much like the randomization of the raft election timers. Another question is the farm API closer in spirit to a NoSQL database. Yeah, that's one way of viewing it. It really doesn't have any of the fancy query stuff like joins, for example, that SQL has. It's really a very low level kind of read write interface plus the transaction support. So you can sort of view it as a NoSQL database with transactions. All right, this is what a transaction looks like. And these are library calls, create, read, write, commit. Commit is a sort of complex library call that actually runs the transaction coordinator code for whatever variant of Two-Face commit that's described in Figure 4. Just repeat that while the read call goes off and actually reads the relevant server, the write call just locally buffers the new, the modified object, and it's only in commit that the objects are sent to the servers. These object IDs are actually compound identifiers for objects, and they contain two parts. One is the identify a region, which is that all the memory of all the servers is split up into these regions, and the configuration manager sort of tracks which servers replicate which region number. So there's a region number in here. And then the client can look up at a table, the current primary and backups for a given region number, and then there's an address, such a straight memory address within that region. And so the client uses the region number to pick the primary and the backup to talk to, and then it hands the address to the RDMA, Nick, and tells a look, please read this address in order to fetch this object. Another piece of detail we have to get out of the way is to look at the server memory layout. In any one server, there's a bunch of stuff in memory. So one part is that the server has, in its memory, if it's replicating one or more regions, it has the actual regions. And what a region contains is a whole bunch of these objects. And each object, there's a lot of objects, there's objects sitting in memory. Each object has in it a header, which contains the version number. So these are versioned objects, but each object only has one version at a time. So this is version number. And in the high bit, let me try again here, in the high bit of each version number is a lock flag. So in the header of an object, there's a lock flag in the high bit and then a version number in the low bit, and then the actual data of the object. So each object has the same, the server's memory has the same layout, a lock bit in the high bit, and the current version number in the low bit. And every time the system writes, modifies an object, it increases the version number. And we'll see how the lock bits are used in a couple minutes. In addition to the server's memory, there are pairs of Q, pairs of message Qs, and logs, one for every other computer in the system. So that means that, you know, if there's four other servers in the system that are running, or if there's four servers that are running transactions, there's going to be four logs sitting in memory that can be appended to with RDMA, one for each of the other servers. And that means that one for each of the other computers that can run transactions. And that means that transaction code running on, you know, so I'm going to number them, you know, so the transaction code running on computer two, when it wants to talk to this server and append to its log, which as we'll see, it's actually going to append to server two's log in this server's memory. So there's a total of n squared of these Qs floating around in each server's memory. And it certainly seems like there's actually one set of logs which are meant to be non-volatile, and then also possibly a separate set of message Qs, which are used just for more RPC-like communication. Again, one in each server, one Q, incoming message Q per other server, written with RDMA writes, all right. Actually, the next thing to talk about is figure four in the paper. This is figure four. And this explains the OCC commit protocol that farm uses. And I'm going to go through most of these steps one by one. And actually to begin with, I'm going to focus only on the concurrency control part of this. These steps also do replication as well as implement serializable transactions, but we'll talk about the replication for fault tolerance a little bit later. Okay, the first thing that happens is the execute phase. And this is the TX reads and TX writes, the reads and writes that the client transaction is doing. So each of these arrows here, what this means is that the transaction runs on computer C, and whenever it needs to read something, it uses one-sided RDMA to simply read it out of the relevant primary server's memory. So what we got here is a primary backup, primary backup, primary backup for three different shards, or imagining that our transaction reads something from one object from each of these shards using one-sided RDMA reads, and that means these are blindingly fast, at five microseconds each. Okay, so the client reads everything and needs to read for the transaction. Also anything that's going to write it first reads, and it has to do it, do this read. It has to first read because it needs to get the version number, the initial version number. All right, so that's the execute phase. Then when the transaction calls TX commit to indicate that it's totally done, the library on the TX commit call on the client acts as a transaction coordinator and runs this whole protocol, which is a kind of elaborate version of a two-phase commit. The first phase, and it's described in terms of rounds of messages, so the transaction coordinator sends a bunch of lock messages and wait for them to reply, and then validate messages and wait for all the replies. So the first phase in the commit protocol is the lock phase. In this phase, what the client is sending is, it sends to each primary the identity of the object. For each object, the client's written and needs to send that updated object to the relevant primary. So it sends the updated object to the primary and as a new log entry in the primary's log for this client. So the client's really using RDMA to append to the primary's log. And what it's appending is the object ID of the object that wants to write the version number that the client initially read when it read the object and the new value. So it appends the object ID version number and new value to the primary log for the primary beach of the charge that it's written an object in. So I guess what's going on here is that this transaction wrote two different objects, one on primary one and the other on primary two. When this is done, when the transaction coordinator gets back to, well, all right, so now these new log records are sitting in the logs of the primaries. The primary though has to actually actively process these log entries because it needs to check, it needs to do a number of checks involved with validation to see if this primary's part of the transaction can be allowed to commit. So at this point, we have to wait for each primary to pull this client's log in the primary's memory, see that there's a new log entry and process that new log entry and then send a yes or no vote to say whether it is or is not willing to do its part of the transaction. All right, so what does the primary do when its polling loop sees that an incoming lock log entry from a client? First of all, if that object with the object ID is currently locked, then the primary rejects this lock message and sends back a message to the client using RDMA saying no that this transaction cannot be allowed to proceed on voting no and two face commit and that will cause the transaction coordinator to abort the transaction. If the object's not locked, then the next thing the primary does is check the version numbers. It checks to make sure that the version number that the client sent it, that is the version number the client originally read, is unchanged. And if the version number is changed, that means that between when our transaction read and when it wrote, somebody else wrote the object if the version number is changed. And so if the version number is changed again, the primary will respond no and forbid the transaction from continuing. But if the version number is the same and the lock that's not set, then the primary will set the lock and return a positive response back to the client. Because the primary is multi-threaded running on multiple CPUs and there may be other transactions. There may be other CPUs reading other incoming log cues from other clients at the same time on the same primary. There may be races between different transactions or the lock record processing from different transactions trying to modify the same object. So the primary actually uses an atomic instruction, a compare and swap instruction, in order to both check the version number and lock and set the lock bit on that version number as an atomic operation. And this is the reason why the lock bit has to be in the high bit, so the version number, so that a single instruction can do a compare and swap on the version number and the lock bit. Okay, now one thing to note is that if the object's already locked, there's no blocking, there's no waiting for the lock to be released, the primary simply sends back a no if some other transaction has it locked. All right, any questions about the lock phase of committing? All right, back in the trend that had in the client, which is acting as transaction coordinator, it waits for responses from all the primaries, from the primaries of the shards for every object that the transaction modified. If any of them say no, if any of them reject the transaction, then the transaction coordinator aborts the whole transaction and actually sends out messages to all the primaries saying, oh, change my mind, I don't want to commit this transaction after all. But if they all answered yes, if all the primaries answered yes, then the transaction coordinator thinks that decides that the transaction can actually commit. But the primaries, of course, don't know whether they all voted yes or not. So the transaction coordinator has to notify all the primaries. So yes, indeed, everybody voted yes. So please do actually commit this. And the way the client does this is by appending another record to the logs of the primaries for each modified object. This time it's a commit backup record that it's appending. And this time the transaction coordinator, sorry, the commit primary, I'm skipping over validate and commit backup for now. I'll talk about those later. So just ignore those for the moment. The transaction coordinator goes on to commit primary, sends a pen to commit primary to each primaries log. And the transaction coordinator only has to wait for the hardware RDMA acknowledgments. It doesn't have to wait for the primaries to actually process the log record. The transaction coordinator, it turns out as soon as it gets a single acknowledgement from any of the primaries, it can return a yes, the okay equals true to the transaction signifying that the transaction succeeded. And then there's another stage later on where the, once the transaction coordinator knows that every primary knows that the transaction coordinated, committed, you can tell all the primaries that they can discard all the log entries for this transaction. Okay, now there's one last thing that has to happen. The primaries which are looking at the logs, they're polling the logs, they'll notice that there's a commit primary record at some point. And then on the primary, that receives the commit primary log entry will, it knows that it had locked that object previously and that the object must still be locked. So what the primary will do is update the object in its memory with the new contents that were previously sent in the lock message, increment the version number associated with that object and finally clear the lock bit on that object. And what that means is that as soon as a primary receives and processes a commit primary log message, it may, since it clears the lock bit and updates the data, it may well expose this new data to other transactions, other transactions after this point are free to use it, are free to use the object with its new value and new version number. All right. I'm going to do an example, any questions about the machinery before I start thinking about an example. Feel free to ask questions anytime. All right. So how about an example? Let's suppose we have two transactions, transaction one and transaction two, and they're both trying to do the same thing. They both just want to increment X, X's object sitting off in some server's memory. So both, we've got two transactions running through this before we look into what actually happens. So we want to remind ourselves what the valid possibilities are for the outcomes. So, and that's all about serializability, farm guaranteed serializability. So that means that whatever farm actually does, it has to be equivalent to some one at a time execution of these two transactions. So we're allowed to see the results you would see if T1 ran, and then strictly afterwards T2 ran, or we can see the results that could ensue if T2 ran, and then T1 run. Those are the only possibilities. Now, in fact, farm is entitled to abort a transaction. So we also have to consider the possibility that one of the two transactions aborted, or indeed that they both aborted. Now, since they're doing both doing the same thing, there's a certain amount of symmetry here. So one possibility is that they both committed, and that means two increments happened. So one legal possibility is that X is equal to two, and both, and the TX commit has to agree with whether things aborted or committed, so that both transactions need to see TX commit return true in this case. Another possibility is that only one of the transactions committed in the other aborted, that's the only one true in the other false. And another possibility is maybe they both aborted. I don't think this could necessarily happen, but it's actually legal so that X isn't changed and we want both to get false back from TX commit. So we better not see anything other than these three options. All right. So of course what happens depends on the timing. So I'm going to write out various different ways that the commit protocol could interleave, and for convenience I have a handy reminder of what the actual commit protocol is here. So one possibility is that they run exactly in lockstep. They both send all their messages at the same time. They both read at the same time. They assume that X starts out at zero. If they both read at the same time, they're both going to see zero. Now I'm going to assume they both send out lock messages at the same time. And indeed they accompany their lock messages with the value one, since they're adding one to it, and that if they commit, if they lock messages say yes, then they would if they did both commit at the same time. So if this is the scenario, what's going to happen? And why? I'd like to raise their hand and hazard a guess. Well, I certainly both feel to read, since that's a one-sided read, can't possibly fail. They're both going to send, in fact, identical lock messages to whatever primary holds object X, and they both send the same version number, whatever version number they read, in the same value. So the primary is going to see two log, two log messages in two different incoming logs, assuming these are running on different clients. And exactly what happens now is slightly left up to our imagination by the paper. But I think the two incoming log messages could be processed in parallel on different cores on the primary. But the critical instruction in the primary is the atomic test and set or compare and swap. Exactly. Somebody's just volunteered the answer that one of them will get to the compare and swap instruction first. And whichever core gets to the compare and swap instruction first, it'll set the lock bit on that object's version and will observe the lock bit wasn't previously set, whichever one executes the atomic compare and swap second will observe the lock bits already set. That means the one of the two will return yes and the other two will fail the lock. Observe the lock is already set and return no. And, you know, for symmetry, I'm just going to imagine that transaction two, the primary sends back a no. So the transaction two's client code will abort. Transaction one, I've got the lock, got a yes back, and it'll actually commit. When it commits, when the primary actually gets the commit message, it'll install the updated object, you know, incremented to two, clear the lock bit, increment the version and return true. So this is going to say true because the other primary sent back a no. That means that TX commits going to return false here and the final value will be X equals one. That was one of our allowed outcomes. Of course, it's not the only interleaving. Any questions about how this played out or why executed the way it did? Okay, so there's other possible interleavings. So how about this one? Let's imagine that transaction two does the read first. It doesn't really matter whether the reads are concurrent or not. Then transaction one does read, and then transaction one a little bit faster, and it gets its lock message in on a reply and gets a commit back. And then afterwards, transaction two gets going again and sends a lock message in if it could commit. So what happens this time? This lock message is going to be succeed because there's no reason to believe that the lock bit is set and the second lock message hasn't even been sent. The lock message will set the lock. The commit message, this commit primary message will actually clear the lock bit. So the lock bit will be clear by the time T2 census inserts its lock entry in the primary's log. So this, the primary won't see the lock bit set at this point. Yeah, so somebody's volunteered that what this primary will see is that the version number, so the lock message contains the version number that transaction two originally read. And so the primary is going to see, wait a minute, since commit primary increments the version number, the primary is going to see that the version number is wrong. The version number is now higher on the real object. And so it's actually going to send back a no response to the coordinator and the coordinator is going to abort this transaction. And again, we're going to get X equals one, one of the transactions returned to the other return to false, which is the same final outcome as before and it is allowed. Any questions about how this played out? A slightly different scenario would be as if, and actually, okay, the slightly different scenario I was going to think of was one in which the commit message was actually happened after this lock, but this is essentially the same as the first scenario in which this transaction got the lock set and this transaction observed the lock was set. Okay. All right, one last scenario. Let's suppose we see this. What's going to happen this time? Yeah. Somebody has the right answer. Of course, the first transaction will go through because there's no contention on the first transaction. The second transaction when it goes to read X, we'll actually see the new version number as incremented by the commit primary processing on the primary. So it'll see the new version number. The lock that won't be set. And so then when it goes to send its lock log entry to the primary, the lock processing code on the primary will see how the lock's not set and the version is the same. This is the latest version and it'll allow it to commit. And so for this, the outcome we're going to see is X equals two because this read not only read the new version number, it actually read the new value, which was one. So this is incorrect here. Boom. And both calls to TX commit will be true. Yes. That's right. We'll succeed it with X equals two. All right. So, you know, this happened to work out in these cases. The intuition behind why the optimistic concurrency control provides serializability. Why it basically checks that the execution that did happen is the same as a one at a time execution. Essentially, the intuition is that if there was no conflicting transaction, then the version numbers in the lock bits won't have changed. If nobody else is messing with these objects, you know, we'll see the same version numbers at the end of the transaction as we did when we first read the object. Whereas if there is a conflicting transaction between when we read the object and when we try to commit a change and that conflicting transaction modified something, then if it actually started to commit, we will see a new version number or a lock bit set. So the comparison of the version numbers in lock bits between when you first read the object and when you finally commit, kind of tells you whether some other commit to the objects snuck in while you were using them. Right. And, you know, the cool thing to remember here is that this allowed us to do the reads, the use of this optimistic scheme in which we don't actually check the locks when we first use the data allowed us to use extremely fast one-sided RDMA reads to read the data and get high performance. Okay. So the way I've explained it so far without validate and without commit backup is the way the system works. But as I see, validate is sort of an optimization for just reading an object but not writing it. And commit backup is part of the scheme for fault tolerance. I think I'm going to in the few minutes we have left, I'm going to talk about validate. So the validate stage is it's an optimization for to treat objects that were only read by the transaction and not written. And it's going to be particularly interesting if it's a straight read only transaction that modified nothing. And, you know, the optimization is that it's going to be that the transaction coordinator can execute the validate with a one-sided read that's extremely fast rather than having to put something out of log and wait for the primary to see our log entry and think about it. So this validates one-sided read is going to be much, much faster. It's going to essentially replace lock for objects that were only read and it's going to be much faster. And basically what's going on here is that the, what the validate does is the transaction coordinator refetches the object header. So, you know, it would have read an object, say this object, in the execute phase. When it's committing it, instead of sending a lock message, it refetches the object header and checks whether the version number now is the same as the version number when it first read the object. And it also checks if the lock bit is clear. So that's how it works. So instead of sending a lock message, send this validate message. It should be much faster for a read-only operation. So let me put up another transaction example and run through how it works. Let's suppose x and y are initially zero. We have two transactions. If x is equal to zero, set y equal one. And T2 says if y is zero, set x equals one. By the way, this is a absolutely classic test for strong consistency. If the execution is serializable, it's going to be either T1 then T2 or T2 then T1. It's got to get any correct implementation has to get the same results. It's running them one at a time. If you run T1 and then T2, you're going to get y equals one and x equals zero because the second if statement, y is already one. The second if statement won't do anything. And symmetrically, this will give you x equals one and y equals zero. And it turns out that if they both abort, you can get x equals zero, y equals zero. But what you're absolutely not allowed to get is x equals one, y equals one. That's not allowed. Okay. So we're looking for how we're going to use this as a test to see what happens with Validate. And again, we're going to suppose these two transactions execute. Most obvious cases they execute absolutely at the same time. And that's the hardest case. Okay. So as we have read of x, read of y, we're going to walk y because we wrote it and walk y here. Sorry, lock x here. But since now we're using this read only validation optimization, that means this one has to validate y. This one has to validate x, you know, because it read x but didn't write it. So it's going to validate it much quicker. And maybe it's going to commit and maybe it's going to commit. And so the question is, if we use this Validate, as I described it, just checks the version number and lock. The version number hasn't changed and the lock isn't set. And we'll get a correct answer. And no, actually both, the validation is going to fail for both because when these lock messages were processed by the relevant primaries, they caused the lock bit to be set. Initially, presumably the reads indicated a cleared lock bit. But when we come to validate, even though the client is doing a one-sided read of the object header for x and y, it's going to see the lock bit that was set by the processing of these lock requests. And so they're both going to see the lock bit set on the object that they merely read and they're both going to abort and neither x nor y will be modified. And so that was one of the legal outcomes. That's right, somebody noticed this. Indeed, both validates will fail. Another, of course, sometimes the transaction can go through and here's a scenario in which it does work out. So transaction one is a little faster, validates. All right, so what's going to happen to transaction one is a little bit faster. So this time it's validates going to succeed because nothing has happened to x between when transaction one read it and when it validated. So presumably the lock also went through without any trouble because nobody's modified y here either. So the primary answer, yes, for this, the one-sided read revealed an unchanged version number and lock bit here. That's a transaction one can commit. And it will have incremented y. But by this point, if this is the order, when the primary process is this, actually when the primary process is lock of x, this will also go through with no problem because nobody's modified x. When the primary for y processes the validate for y, though, sorry, when the client running transaction two refetches the version number and lock bit for y, it's either going to see, it really depends on whether the commit has happened. If the commit hasn't happened yet, this validate will see that the lock bit is set because it was set back here. If it hasn't happened already, then the lock bit will be clear, but this validate one-sided read will see a different version number than was originally seen. And indeed somebody suggests this answer. So one will commit. So transaction one will commit and transaction two will abort. And although I don't have time to talk about it here, if there's a straight read-only transaction, then there doesn't need to be a locking phase and there doesn't need to be a commit phase. The read-only transactions can be done with just reading, blind reads for the reads. Sorry, one-sided RDMA reads for the reads. One-sided RDMA reads for the validates. And so they're extremely fast read-only transactions are and don't require any work, any attention by the server. So, and then this is at the heart of, you know, trends, these reads and indeed though, everything about farm is very streamlined partially due to RDMA. And it uses OCC because it's basically forced to in order to be able to do reads without checking locks. There are a few brown downsides though. It turns out optimistic and currency control really works best if there's relatively few conflicts. If there's conflicts all the time, then transactions will have to board. And there's a bunch of other restrictions I already mentioned like on farm, like the data must all fit in the RAM and all the computers must be in the same data center. Nevertheless, this was viewed at the time and still as just a very surprisingly high speed implementation of distributed transactions, like just much faster than any system in sort of in production use. And it's true that hardware involves a little bit exotic. It really depends on this nonvolatile RAM scheme and it depends on these special RDMA nicks. And those are not particularly pervasive now, but you do, but you can get them. And with performance like this, it seems likely that they'll both NVAM and RDMA will eventually be pretty pervasive in data centers so that people can play these kinds of games. And that's all I have to say about farm. Happy to take any questions if anybody has some. And if not, I'll see you next week with Spark, which is you may be happy to know absolutely not about transactions. All right, everyone. Bye-bye.