 Today, the paper I'm going to discuss is Frangipani. This is a fairly old distributed file system paper. The reason why we're reading it, though, is because it has a lot of interesting and good design, having to do with cache coherence and distributed transactions and distributed crash recovery, as well as the interactions between them. So those are really the ideas behind this that we're going to try to tease out. So these are, it's really a lot of our caching coherence is really the idea of if I have something cached that nevertheless, if you modify it, despite me having a cache, something will happen so I can see your modifications. We also have distributed transactions which are needed internally to file systems to be able to make complex updates to the file system data structures. And because the file system is essentially split up among a bunch of servers, it's critical to be able to recover from crashes of those servers. The overall design of Frangipani, it's a network file system. It's intended to look to existing applications. It's intended to work with existing applications like UNIX programs, ordinary UNIX programs, running on people's workstations, much like Athena's AFS lets you get at your Athena home directory and various project directories from any Athena workstation. So the overall picture is that you have a bunch of users. Each user in the paper's world is sitting in front of a workstation, which is not a laptop in those days, but a computer with a keyboard and display and a mouse and a window system and all. So each one is sitting in front of a computer at their own workstation. I'm going to call these workstations workstation one, workstation two. Each workstation runs an instance of the Frangipani server. So a huge amount of the, almost all of the stuff that happens in this paper goes on in the Frangipani software in each workstation. So maybe they're sitting in front of a workstation and they might be running ordinary programs like a text editor that's reading and writing files and maybe when they finished editing a source file, they run it through the compiler that reads their source file. When these ordinary programs make file system calls, inside the kernel, there's a Frangipani module that implements the file system inside of all of these workstations, each other own copy. And then the real storage of the file system data structures, things like certainly file contents but also inodes and directories and a list of file in each directory and the information about what inodes and what blocks are free. All of that stored in a shared virtual disk service called pedal. It's on a separate set of machines that are probably server machines in a machine room rather than workstations on people's desks. Pedal, among many other things, replicates data. So you can sort of think of pedal servers as coming in pairs. So if one crashes, we can still get out our data. And so when Frangipani needs to read or write, read a directory or something, it sends a remote procedure call off to the correct pedal server to say, oh, here's the block that I need. Please read it for me. Please return that block. And for the most part, pedal is acting like a disk drive. You can think of it as a kind of shared disk drive that all of these Frangipanis talk to. And it's called a virtual disk. From our point of view, for most of this discussion, we're just going to imagine pedal as just being a disk drive that's used over the network by all these Frangipanis. And so you read and write it by giving it a block number or an address on the disk. And same block I'd like to read that block, just like an ordinary hard drive. OK, so the intended use for this use that the authors intended is actually reasonably important driver in the design. What they wanted was to support their own activities. And what they were were members of a research lab of maybe, say, 50 people in this research lab. And they were used to shared infrastructure, things like time-sharing machines or workstations using previous network file systems to share files among cooperating groups of researchers. So they wanted a file system that they could use to store their own home directories in, as well as storing shared project files. And so that meant that if I added a file, I'd really like the other people I work with to be able to read the file I just edited. So we want that kind of sharing. And in addition, it's great if I can sit down at any workstation. My workstation, your workstation, a public workstation in the library and still get at all of the files of all my home directory, everything I need in my environment. So they're really interested in a shared file system for human users in a relatively small organization. Small enough that everybody was trusted, all the people, all the computers. So really, the design has essentially nothing to say about security. And indeed, arguably, would not work in an environment like Athena, where you can't really trust the users or the workstations. So it's really very much designed for their environment. Now, as far as performance, their environment was also important. It turns out that the way most people use computers, or at least workstations they sit in front of, is that they mostly read and write their own files. And they may read some shared files, programs, or some project files, or something. But most of the time, I'm reading and writing my files, and you're reading and writing your files on your workstation. And it's really the exception that we're actively sharing files. So it makes a huge amount of sense to be able to, one way or another, even though, officially, the real copies of files are stored in the shared disk. It's fantastic if we can have some kind of caching, so that after I log in and I use my files for a while, they're locally cached here so they can be gotten at in microseconds instead of milliseconds, if we have to fetch them from the file servers. OK, so Frangipani supported this kind of caching. Furthermore, it supported write-back caching. So we have not only caching in each workstation, in each Frangipani server, we also have write-back caching, which means that if I want to modify something, if I modify a file or even create a file in a directory or delete a file or basically do any other operation, as long as nobody else, no other workstation needs to see it, Frangipani acts with a write-back cache. And that means that my writes stay only local in the cache if I create a file, at least initially, the information about the newly created file, say, a newly allocated inode with initialized contents and a new entry added to it, new name added to my home directory, all those modifications initially are just done in the cache. And therefore, things like creating a file can be done extremely rapidly. They just require modifying local memory in this machine's disk cache. And they're not written back in general to petal until later. So at least initially, we can do all kinds of modifications to the file system, at least to my own directories, my own files, completely locally. And that's enormously helpful for performance. It's like a factor of 1,000 difference being able to modify something in local memory versus having to send a remote procedure calls to some server. Now, one serious consequence of that that's extremely determinative of the architecture here is that that meant that the logic of the file system has to be in each workstation. In order for my workstation to be able to implement things like create a file, just operating out of its local cache, it means all the logic, all the intelligence for the file system has to be sitting here in my workstation. And in their design, basically, to a first approximation, the petal shared storage system knows absolutely nothing about file systems or files or directories. All that logic, this is a very, in a sense, very straightforward, simple system. And all the complexity is here in the frangipani in each client. So it's a very kind of decentralized scheme. And one of the reasons is because that's what you really need, or at least that was a design they could think of, to allow them to do modifications purely locally in each workstation. It does have the nice side effect, though, that since most of the complexity and in most of the CPU time spent is spent here, it means that as you add workstations, as you add users to the system, you automatically get more CPU capacity to run those new users' file system operations. Because most file system operations happen just locally in the workstations. Most of the CPU time is spent here. So the system does have a certain degree of natural scaling, scalability, as you add workstations. Each new workstation is a bit more load from a new user, but it's also a bit more available CPU time to run that user's file system operations. Of course, at some point, you're going to run out of gas here in the central storage system, and then you may need to add more storage servers, too. So we have the system that does serious caching here, and furthermore does the modifications in the cache. That actually leads immediately to some serious challenges in the design. And the design is mostly about solving the challenges I'm about to lay out. These are largely challenges that come from caching and this sort of decentralized architecture where most of the intelligence is sitting in the clients. So the first challenge is that suppose workstation 1 creates a file in maybe a file, say, slash a, a new file slash a. And initially, it just creates this in its local cache. So at first it may need to fetch the current contents of the slash directory from pedal, but then when it creates a file, it just modifies its cached copy and doesn't immediately send it back to pedal. Then there's an immediate problem here. Suppose the user on workstation 2 tries to get a directory listing of the directory slash. We'd really like to be able to this user see the newly created file. And that's what users are going to expect. And users will be very confused if the person down in the hall from me created a file and said, oh, I put all this interesting information in this new file slash a. Why don't you go read it? And then I try to read it and it's totally not there. So we absolutely want very strong consistency. If the person down the hall says they've done something in the file system, I should be able to see it. And if I edit a file on one workstation and then maybe compile it on a compute server on another computer, I want the compiler to see the modifications I just made to my file, which means that the file system has to do something to ensure that readers see even the most recent writes. So we've been talking about this as we've been calling this strong consistency and linearizability before. And that's basically what we want. In the context of caches, though, the issue here is not really about the storage server necessarily. It's about the fact that there's a modification here that needs to be seen somewhere else. And for historical reasons, that's usually called cache coherence. That is the property of a caching system that even if I have an old version of something cached, if someone else modifies it in their cache, then my cache will automatically reflect their modifications. So we want this cache coherence property. Another issue we have is that everything, all the files and directories are shared, we could easily have a situation where two different workstations are modifying the same directory at the same time. So suppose, again, maybe the user one on their workstation wants to create a file slash a, which is a new file in the directory slash in the root directory. And at the same time, user two wants to create a new file called slash b. So at some level, they're creating different files already, a and b, but they both need to modify the root directory to add a new name to the root directory. And so the question is, even if they do this simultaneously, two file creations of differently named files put in the same directory from different workstations, will the system be able to sort out these concurrent modifications to the same directory and arrive at some sensible result? And of course, the sensible result we want is that both a and b end up existing. We don't want to end up with some situation in which only one of them ends up existing because the second modification overwrote and sort of superseded the first modification. And so this is, again, it goes by a lot of different names, but we'll call it atomicity. We want operations such as create a file, delete a file to act as if they just are instantaneous, instantaneous in time and therefore don't ever interfere with operations that occur at similar times by other workstations. We want things to happen just at a point in time and not be spread over, even if they're complex operations and involve touching a lot of state. We want them to appear as if they occur instantaneously. And a final problem we have is suppose my workstation has modified a lot of stuff and maybe its modifications or many of its modifications are done only in the local cache because of this right back caching. If my workstation crashes after having modified some stuff in its local cache and maybe reflected some but not all of those modifications back to storage and pedal, other workstations are still executing and they still need to be able to make sense of the file system. So the fact that my workstation crashed while I was in the middle of something had better not wreck the entire file system for everybody else or even any part of it. So that means what we need is crash recovery of individual servers. We want to be able to have my workstation crash without disturbing the activity of anybody else using the same shared system. Even if they look at my directory and my files, they should see something sensible. Maybe it won't include the very last things I did, but they should see a consistent file system and not a wrecked file system data structure. So we want crash recovery. And as always, with distributed systems, it's made more complex because we can easily have a situation where only one of the servers crashes but the others are running. And again, for all of these things, for all three of these challenges, they're really challenged in this discussion. They're challenges about how Frangipani works and how these Frangipani software inside the workstations work. And so when I talk about a crash, I'm talking about a crash of a workstation and it's Frangipani. The pedal virtual disk has many similar questions associated with it, but they're not really the focus today. It has a completely separate set of reliable fault tolerance machinery built into pedal. And it's actually a lot like the chain replication kind of systems we talked about earlier. OK, so I'm going to talk about each of these challenges in turn. The first challenge is cache coherence. The game here is to get both the benefits of both linearizability. That is, when I read, when I look at anything in the file system, I always see fresh data. I always see the very latest data. So we want both linearizability and caching. We want caching for as good caching as we can get for performance. So somehow we need to get the benefits of both of these. And the kind of people implement cache coherence that is using what are called cache coherence protocols. And it turns out these protocols are used a lot in many different situations, not just distributed file systems, but also things like the caches in multicore, the per-core caches in multicore processors also use cache coherence protocols, which are going to be not unlike the protocols I am going to describe for Franjapani. All right, so it turns out that Franjapani's cache coherence is driven by its use of locks. And we'll see locks come up later in both, actually for both atomicity and crash recovery. But the particular use of locks I'm going to talk about for now is the use of locks to drive cache coherence, to help workstations ensure that even though they're caching data, they're caching the latest data. So as well as the Franjapani servers and workstations and petal servers, there's a third kind of server in the Franjapani system. There's lock servers. So I'm just going to pretend there's one lock server, although you could shard the locks over multiple servers. So here's the lock server. It's a separate, you know, it's logically at least a separate computer, although I think they ran them on the same hardware as the petal servers. But it basically just has a table of named locks. And the locks are named, well, we'll consider them to be named after file names, although in fact they're named after i-numbers. So for every file, we have a lock potentially. And each lock is possibly owned by some owner. For this discussion, I'm just going to describe it as if the locks were exclusive locks, although in fact Franjapani has a more complicated scheme for locks that allow either one writer or multiple readers. So for example, maybe File X has recently been used by Workstation 1. And Workstation 1 has a lock on it. And maybe File Y is recently used by Workstation 2. And Workstation 2 has a lock on it. And the lock server will remember, ah, for each file who has the lock, if anyone. Maybe nobody does on that file. And then in each workstation, each workstation keeps track of which locks it holds. And this is tightly tied to keeping track of cache data as well. So in each workstation's Franjapani module, there's also a lock table. And it records what file, the workstation to lock for, what kind of lock it has, and the contents, the cached contents of that file. So that might be a whole bunch of data blocks, or maybe directory contents, for example. So we also have content here. So when a Franjapani server decides, oh, it needs to read, it needs to use the directory slash or look at the file a, or look at an inode, it first asks the lock server for a lock on whatever it's about to use. And then it asks pedal to get the data for whatever that file or directory or whatever it is, then it needs to read. And then the workstation remembers, oh, I have a copy of file x. Its content is whatever the content of file x is cached. And it turns out that workstations can have a lock in at least two different modes. The workstation can be actively reading or writing whatever that file or directory is right now, that it is in the middle of a file creation operation, or deletion, or rename, or something. So in that case, I'll say that the lock is held by the work station and is busy. It could also be after a workstation has done some operation like create a file or maybe read a file, then release the lock as soon as it's done with that system call, whatever system call like rename or read or write or create. As soon as the system calls over, the workstation will give up the lock, at least internally. It's not actively using that file anymore. But as far as the lock server is concerned, the workstation will hold the lock, but the workstation notes for its own use that it's not actively using that lock anymore. So we'll call that the lock is still held by the workstation. But the workstation isn't really using it. And that will be important in a moment. OK, so I think these two are set up consistently. If we assume this is workstation 1, the lock server knows, oh, locks for x and y exist, and they're both held by workstation 1. Workstation 1 has equivalent information at its table. So it knows it's holding these two locks. And furthermore, it's remembering that content is cached for the filers of directories that the two locks cover. There's a number of rules here that Frangipenni follows that cause it to use locks in a way that provide cache coherence, then ensure nobody's ever reading, using stale data from their cache. So these are basically rules that are using conjunction with the locks and cache data. So one, the really overriding invariant here is that no workstation is allowed to cache data, to hold any cache data, unless it also holds the lock associated with that data. So basically, it's no cache data without a lock, without the lock that protects that data. And operationally, what this means is a workstation, before it uses data, it first acquires the lock on the data from the lock server. And after the workstation has the lock, only then does the workstation read the data from petal and put it in its cache. So the sequence is you can acquire a lock and then read from petal. So until you have the lock, of course, you weren't caching the data, if you want to cache the data, you first got to get the lock. And only strictly afterwards, read from petal. And if you ever release a lock, then the rule is that before releasing a lock, you first have to write, if you modify the lock data in your cache, before you release the lock, you have to write the data back to, modify data back to petal. And then only when petal says, yes, I got the data, only then are you allowed to release the lock, that is, gives the lock back to the lock server. So the sequence is always, first you write the cache data to petal, the storage system, and then release the lock, erase the entry and the cache data from that workstations lock table. What this results in, the protocol between the lock server and between the workstations and the lock server consists of four different kinds of messages. So this is the coherence protocol. These are just network. You can think of them essentially as sort of one-way network messages. There's a request message from workstations to the lock server. And request message says, oh, hey lock server, I'd like to get this lock. When the lock server is willing to give you the lock, and of course, if somebody else holds it, the lock server can't immediately give you the lock. But when the lock becomes free, the lock server will respond with a grant message from the lock server back to the workstation in response to an earlier request. Well, if you request a lock for the lock server and someone else holds the lock right now, that other workstation has to first give up the lock. We can't have two people owning the same lock. So how are we going to get that workstation to give up the lock? Well, what I said here is that when a lock station is when it's actually using the lock and actively reading or writing something, it has the lock and it's marked it busy. But the workstations don't give up their locks ordinarily when they're done using them. So if I create a file and then create system call finishes, I'll still have that new file locked and also own the lock for that. My workstation will still own the lock for that file. It'll just be in state idle instead of busy. But as far as the lock server is concerned, well, my workstation still has the lock. And the reason for this, the reason to be lazy about handing locks back to the lock server is that if I create a file called Y on my workstation, I'm almost certainly going to be about to use Y for other purposes, like maybe write some data to it or read from it or something. So it's extremely advantageous for the workstation to sort of accumulate locks for all of the recently used files in the workstation and not give them back unless it really has to. And so in the ordinary and the common case in which I use a bunch of files in my home directory and nobody else on any other workstation ever looks at them, my workstation ends up accumulating dozens or hundreds of locks in idle state for my files. But if somebody else does look at one of my files, they need to first get the lock and I have to give up the lock. So the way that works is that if the lock server receives a lock request and it sees in the lock server table, aha, that lock is currently owned by workstation one, the lock server will send a revoke message to whoever, the workstation that currently owns that lock saying, look, somebody else wants it, please give up the lock. When a workstation receives a revoke request, if the lock is idle, then if the cache data is dirty, the workstation will first write the dirty data, the modified data from this cache back to pedal because the rule says, the rule then order to never cache data without a lock says we gotta write the modified data back to pedal before releasing. So if the lock's idle would first write back the data if it's modified back to pedal and only then send a message back to the lock server saying, it's okay, we give up this lock. So the response to a revoke sent to a workstation is the workstation sends a release. Of course, if the workstation gets a revoke while it's actively using a lock, while it's in the middle of a delete or a rename or something that affects the locked file, the workstation will not give up the lock until it's done using it, until it's finished that file system operation, whatever system call it was that was using this file and then the workstation's lock state will transition to idle and then you will be able to pay attention to the revoke request and after writing to pedal if need be, release the lock. All right, so this is the coherence protocol that, well, this is a simplification of the coherence protocol that Franjapani uses. As I mentioned before, what's missing from all this is the fact that locks can be either exclusive for writers or shared for read-only access. And just like pedal is a block server and doesn't understand anything about file systems, the lock server also, these IDs, these are really lock identifiers and the lock server doesn't know anything about files or directories or file system, it just has this table with opaque IDs and who owns, you know, that name locks and who owns those locks and it's Franjapani that knows, ah, you know, the lock that I associate with a given a file has such and such an identifier and as it happens Franjapani uses Unix-style I numbers, sort of numbers associated with files instead of names for locks. So just to make this coherence protocol concrete and to illustrate again the relationship between pedal operations and lock server operations, let me just run through what happens if one workstation modifies some file system data and then another workstation needs to look at it. So we have two workstations, the lock server. So the way the protocol plays out, if workstation one wants to read, say workstation one wants to read and then modify file Z. So before it can even read anything about Z from pedal, it must first acquire the lock for Z. So it sends an acquire request to the lock server. Maybe nobody holds the lock or lock server's never heard anything about it. So the lock server makes a new entry for Z and its table returns a reply saying yes. You own the grant for lock Z. And at this point the workstation, since it has the lock on file Z, is entitled to read information about it from pedal. So at this point we're gonna read Z from pedal. And indeed workstation one can modify it locally in their cache. At some later point maybe the human being sitting in front of workstation two wants to also to read file Z. Well workstation two doesn't have the lock for file Z. So the very first thing it needs to do is send a message to the lock server saying, oh yeah, I'd like to get the lock for file Z. The lock server knows it can't reply yes yet because somebody else has the lock, namely workstation one. And the lock server sends in response a revoke to workstation one. Workstation one not allowed to give it the lock until it writes any modified data back to pedal. So it's now gonna write anything modified content, actual contents of the file if it was modified back to pedal. Only then is workstation two allowed to send a release back to the lock server. The lock server must have kept a record in some tables saying well you know there's somebody waiting for lock Z as soon as it's current holder releases that we need to reply. And so this receipt of this release will cause the lock server to update its tables and finally send the grant back to workstation two. And at this point now workstation two can finally read file Z from pedal. So this is how the cache coherence protocol plays out to ensure that everybody who does a read doesn't read the data until whoever the previous, you know until anybody who might have had the data modified privately in their cache first writes the data back to pedal. All right, so the locking machinery forces reads to see the latest write is what's going on. There's a number of the optimizations that are possible in these kind of cache coherence protocols. I mean I've actually already described one. This idle state, the fact that workstations hold on to locks that they're not using right now instead of immediately releasing them. That's already an optimization to the simplest protocol you can think of. And the other main optimization is that Frangipani has is that it has a notion of shared versus shared read locks versus exclusive write locks. So if lots and lots of workstations need to read the same file but nobody's writing it they can all have a lock, a read lock on that file. And if somebody does come along and try to write this file that's widely cached they first need to first revoke everybody's read lock so that everybody gives up their cached copy and only then is a writer allowed to write the file. But it's okay now because nobody has a cached copy anymore so nobody could be reading stale data whilst being written. All right, so that's the cache coherence story driven by driven by the locking protocol. Next up in our list of, yes. Workstations that are right back if the lock hasn't been used for long. Okay, yes that's a good question. In fact, there's a risk here in the scheme I described that if I modify a file on my workstation and nobody else reads it for, nobody else reads it that the only copy of the modified file maybe has some precious information in it is in the cache in RAM on my workstation. And if my workstation were to crash then, and we hadn't done anything special then it would have crashed with the only copy of the data and the data would be lost. So in order to first stall this no matter what all these workstations write back anything that's in their cache any modified stuff in their cache every 30 seconds. So that if my workstation crashes unexpectedly I may lose the last 30 seconds of work but no more. This actually just mimics the way ordinary Linux or Unix works. Indeed, a lot of the story is about in the context of a distributed file system trying to mimic the properties that ordinary Unix-style workstations have so that users won't be surprised by Franchipani. It just sort of works much the same way that they're already used to. All right, so our next challenge is how to get out of misity. That is how to make it so even though when I do a complex operation like creating a file which after all involves marking a new inode as allocated initializing the inode the inode is a little piece of data that describes each file maybe allocating space for the file adding a new name in the directory for my new file as many steps and many things that have to be updated we don't want anybody to see any of the intermediate steps. We want people, other workstations to either see the file not exist or completely exist but not something in between. We want atomic multi-step operations. All right, so in order to implement this in order to make multi-step operations like file create or rename or delete atomic as far as other workstations are concerned Franchipani has a implement the notion of transactions. That is as a complete sort of database style transaction system inside it again driven by the locks. Furthermore, this is actually distributed transaction system and we'll see more we'll hear more about distributed transaction systems later in the course. They're like a very common requirement in distributed systems. The basic story here is that Franchipani makes it so that other workstations can't see my modifications until I'm completely done by an operation by first acquiring all the locks on all the data that I'm gonna need to read or write during my operation and not releasing any of those locks until it's finished with the complete operation and of course, following the coherence rule written all of the modified data back to pedal. So before I do an operation like renaming like moving a file from one directory to another which after all modifies both directories and I don't want anybody to see the file being in neither directory or something in the middle of the operation in order to do this Franchipani first acquires all the locks for the operation, then do everything like all the updates right to Franchipani, sorry, right to pedal and then release. Of course, it's easy but now since we already have the locking server anyway in order to drive the cash coherence protocol, just by making sure we hold all the locks for the entire duration of an operation, we get these sort of indivisible atomic transactions almost for free. So an interesting thing to know and that's basically all there is to say about making operations atomic and Franchipani's got to hold all the locks. An interesting thing about this use of locks is that Franchipani's using locks for two almost opposite purposes. For cash coherence, Franchipani uses the locks to make sure that rights are visible immediately to anybody who wants to read them. So this is all about using locks essentially to kind of make sure people can see rights. This use of locks is all about making sure people don't see the rights until I'm finished with an operation because I hold all the locks until all the rights have been done. So they're sort of playing an interesting trick here by reusing the locks they would have had to have anyway for transactions in order to drive cash coherence. All right, so the next interesting thing is crash recovery. We need to cope with the possibility. The most interesting possibility is that a workstation crashes while holding locks and while in the middle of some sort of complex set of updates. That is we've worked station acquired a bunch of locks. It's writing a whole lot of data to maybe create or delete files or something. It's possibly written some of those modifications back to pedal because maybe it was gonna soon release locks or had been asked by the lock server to release locks. So it's maybe done some of the rights back to pedal for its complex operations but not all of them and then crashes before giving up the locks. That's the interesting situation for crash recovery. So there's a number of things that don't work very well. For workstation crashes, crashing with hold with locks. One thing that doesn't work very well is to just observe the workstations crashed and just release all its locks because then if it's done something like create a new file and it's written the file's directory entry, its name back to pedal but it hasn't yet written the initialized iNode that describes the file. The iNode may still be filled with garbage or some previous files, information and pedal and yet we've already written the directory entry. So it's not okay to just release a crash file servers or release of crash workstations locks. Another thing that's not okay is to not release the crash workstations locks. That would be correct because if it crashed while in the middle of writing out some of its modifications, the fact that it hadn't written out all of them means it can't have released its locks yet. So simply not releasing its locks is correct because it would hide this partial update from any readers and so nobody would ever be confused by seeing partially updated data structures in pedal. On the other hand, then anybody who needed to use those files would have to wait forever for the locks if we simply didn't give them up. So we absolutely have to give up the locks in order that other workstations can use the system, can use those same files and directories but we have to do something about the fact that the workstation might have done some of the rights but not all for its operations. So, pre-Japanese like almost every other system that needs to implement crash recoverable transactions uses write-ahead logging. This is something we've seen at least one instance of in the last lecture with Aurora. I was also using write-ahead logging. So, the idea is that if workstation needs to do a complex operation that involves updating many pieces of data in pedal in the file system, the workstation will first, before it makes any rights to pedal, append a log entry to its log in pedal, describing the full set of operations it's about to do. And only when that log entry describing the full set of operations is safely in pedal, where now anybody else can see it, only then will the workstation start to send the rights for the operation out to pedal. So, if a workstation could ever reveal even one of its rights for an operation to pedal, it must have already put the log entry describing the whole operation. All of the updates must already exist in pedal. So, this is very standard. This is just a description of write-ahead logging. But, there's a couple of odd aspects of how Frangipani implements write-ahead logging. The first one is that in most transaction systems, there's just one log, and all the transactions in the system, they're all sitting there in one log in one place. So, there's a crash, and there's more than one operation that affects the same piece of data, we have all of those operations for that piece of data and everything else right there in the single log sequence. And so, we know for example, which is the most recent update to a given piece of data. But, Frangipani doesn't do that to this. It has per workstation logs. It has one log per workstation, and they're separate logs. The other very interesting thing about Frangipani's logging system is that the workstation logs are stored in pedal and not on local disk. In almost every system that uses logging, the log is tightly associated with whatever computer is running the transactions. That is, it's almost always kept on a local disk. But, for extremely good reasons, Frangipani workstations store their logs in pedal in the shared storage. Each workstation had its own sort of semi-private log, but it's stored in pedal storage where if the workstation crashes, its log can be gotten at by other workstations. So, the logs are in pedal. And this is like separate logs per workstation stored somewhere else in public and sort of shared storage. It's like a very interesting and unusual arrangement. All right, so, we kind of need to know roughly what's in the log, what's in a log entry. And unfortunately, the paper's not super explicit about the format of a log entry. But we can imagine that the paper does say that each workstation's log sits in a known place, a known range of block numbers in pedal. And furthermore, that each workstation uses its log space in pedal in a kind of circular way that it will write log entries along from the beginning and when it hits the end, the workstation will go back and reuse its log space back at the beginning of its log area. And of course that means that workstations need to be able to clean their log so that it sort of ensure that a log entry isn't needed before that space is reused. And I'll talk about that in a bit. But each log consists of a sequence of log entries. Each log entry has a log sequence number. It's just an increasing number. Each workstation numbers this log entries, one, two, three, four, five. And the immediate reason for this, maybe the only reason for this, that the paper mentions, is that the way that a friend of Penny just detects the end of a workstation's log if the workstation crashes is by scanning forwards in its log in pedal until it sees the increasing sequence stop increasing. And it knows then that the log entry with the highest log sequence number must be the very last entry. So it needs to be able to detect the end of the log. Okay, so we have this log sequence number. And then I believe each log actually has an array of descriptions of mod of each log entry has an array of the descriptions of the modifications, all the different modifications that were involved in a particular operation or an operation is a file system system call. So each entry in the array is gonna have a block number as a block number in pedal. There's a version number which we'll get to in a bit. And then there's the data to be written. And so there's a bunch of these required to describe operations that might touch more than one piece of data in the file system. One thing to notice is that the log only contains information about changes to metadata, that is to directories and inodes and allocation bitmaps in the file system. The log doesn't actually contain the data that is written to the contents of files. It doesn't contain the user's data, it just contains information, enough information to make the file system structures recoverable after a crash. So for example, if I create a file called f in a directory, that's gonna result in a new log entry that has two little descriptions of modifications in it. One, a description of how to initialize the new files inode and another description of a new name to be placed in the new files directory. All right, so one thing I didn't mention, so of course the log is really a sequence of these log entries. Initially, in order to be able to do modifications as fast as possible, initially a Frangipani's Workstations log is only stored inside the Workstations own memory and won't be written back to pedal until it has to be. And that's so that, you know, writing anything, including log entries to pedal takes a long time. So we wanna avoid even writing log entries back to pedal as well as writing dirty data or modified blocks back to pedal. We'd like to avoid doing that as long as possible. So the real full story for what happens when a Workstation gets a revoke message from the lock server saying that it has to give up a certain lock. So on, right, and this is the same, you know, this is the Coherence Protocol's revoke message. If the Workstation gets a revoke message, the series of steps it must take is first, it's write, it has to write any parts of its log that are only in memory and haven't yet been written to pedal, it's gotta make sure its log is complete and pedal as the first step. So it writes its log and only then does it write any updated blocks that are covered by the lock that's being revoked. So we write modified blocks just for that revoked lock, send a release message. The reason for this sequencing and for this strict then is that these modifications, if we write them to pedal, you know, they're modifications to the data structure, the file system data structure, and if we were to crash midway through writing these blocks just as usual, we wanna make sure that some other Workstation, somebody else, there's enough information to be able to complete the set of modifications that the Workstation has made even though the Workstation has crashed and maybe didn't finish doing these writes and writing the log first, it's gonna be what allows us to accomplish it. These log records are a complete description of what these modifications are going to be. So first we write the complete log to pedal and then the Workstation can start writing its modified blocks, maybe it crashes, maybe it doesn't, hopefully not. And if it finishes writing its modified blocks, then it can send a release back to the lock server. So if my Workstation has modified a bunch of files and then some other Workstation wants to read one of those files, this is the sequence that happens. Lock server asked me for my locks, I write back, my Workstation write backs it log, then writes the dirty modified blocks to pedal and only then releases. And then the other Workstation can acquire the lock and read these blocks. So that's sort of the non-crash, if a crash doesn't happen, that is the sequence. But of course it's only interesting if a crash happens, yes. I guess if we have a transaction that involves two locks and then they just want to go with one, I assume you have to write the whole log, including other lock sequence numbers, but within that one, I'll send a release log and write the other items that go with the lock. Okay, so for the log, you're absolutely right. It writes the entire log. And yeah, so if we get a revoke for a particular file, the Workstation will write its entire log and then only, it's only, because it's only giving up the lock for Z, it only needs to write back data that's covered by Z. So you have to write the whole log, just the data that's covered by the lock that we needed to give up, and then we can release that lock. So maybe this writing the whole log might be overkill. If it turned out, so here's an optimization that you might or might not care about. If the last modification for file Z for the lock we're giving up is this one, but subsequent entries in my log didn't modify that file, then I could just write just this prefix of my in-memory log back to petal and be lazy about writing the rest, and that might save me some time. I might have to write the log back. It's actually not clear what's safest a lot of time. We have to write the log back at some point anyway, and yeah, I think petal just writes the whole thing. Okay. Okay, so now we can talk about what happens when a workstation crashes while holding locks, right? It needs to modify something, rename a file, create a file, whatever. It's acquired all the locks it needs to modify some stuff in its own cache to reflect these operations. Maybe written some stuff back to petal and then crashed possibly midway through writing. There's a number of points at which it could crash, right? Because this is always the sequence. It always, just always before writing modified blocks from the cache back, the Frenjapenny will always have written its log to petal first. That means that if a crash happens, it's either while the workstation is writing its log back to petal, but before it's written any modified file or directory blocks back, or it crashes while it's writing these modified block back, but therefore definitely after it's written in its entire log, right? So that's a very important, you know, or maybe the crash happened after it's completely finished all of this. So, you know, there's only, because of the sequencing, there's only a limited number of kind of scenarios we may be worried about for the crash. Okay, so the workstation is crashed, it's crashed, you know, it's like to be exciting, it's crashed while holding its locks. The first thing that happens is the lock server sends it a revoke request, and the lock server gets no response, right? That's what starts to trigger anything. If nobody ever asked for the lock, basically nobody's ever gonna notice that the workstation crashed. So let's assume somebody else wanted one of the locks that the workstation had while it was crashed. The lock server will send it a revoke and it will never get a release back from the workstation. After a certain amount of time has passed, and it turns out, Franjpani locks uses leases for a number of reasons. So, you know, after this lease time has expired, the lock server will decide that the workstation must have crashed and it will initiate recovery. And what that really means is telling a different workstation, the lock server will tell some other live workstation, look, workstation one seems to have crashed, please go read its log and replay all of its recent operations to make sure they're complete and tell me when you're done and only then the lock servers are gonna release the locks. So, okay, so, and this is the point at which it was critical that the logs are in pedal because some other workstation is gonna inspect the crash workstation's log in pedal. All right, so what are the possibilities? One is that the workstation crashed before it ever wrote anything back. And so that means this other workstation doing recovery will look at the crash workstation's log, see that maybe there's nothing in it at all and do nothing and then release the locks that the workstation held. Now, the workstation may have modified all kinds of things in its cache, but if it didn't write anything to its log area, then it couldn't possibly have written any of the blocks that it modified during these operations, right? And so while we will have lost the last few operations that the workstation did, the file system is gonna be consistent with the point in time before that crashed workstation started to modify anything because apparently the workstation never even got to the point where it was writing log entries. The next possibility is that the workstation wrote some log entries. The log area. And in that case, the recovering workstation will scan forward from the beginning of the log until it stops seeing the log sequence numbers increasing because that's the point at which the log must end. And the recovering workstation will look at each of these descriptions of a change and basically play that change back into petal. It'll say, oh, you know, there's a certain block number in petal that needs to have some certain data written to it, which is just the same modification that the crashed workstation did in its own local cache. So the recovering workstation will just consider each of these and replay each of the crashed workstation's log entries back into petal. And when it's done that all the way to the end of the crashed workstation's log, as it exists in petal, it'll tell the lock server and the lock server will release the crashed workstation's locks. And that will bring the petal up to date with some prefix of the operations the crashed workstation had done before crashing. Maybe not all of them because maybe it didn't write out all of its log, but the recovery workstation won't replay anything in a log entry unless it has the complete log entry in petal. And so implicitly that means there's gonna be some sort of check sum arrangement or something so that the recovery workstation will know, aha, this log entry is complete and not like partially written. That's quite important because the whole point of this is to make sure that only complete operations are visible in petal and never a partial operation. So it's also important that all the rights for a given operation are grouped together in the log so that on recovery the recovery workstation can do all of the rights for an operation or none of them, but never half of them. Okay, so that's what happens if the crash happens while the log is being written back to petal. Another interesting possibility is that the crash workstation crashed after writing its log and also after writing some of the blocks back itself and then crashed. And then skimming over some extremely important details, which I'll get to in a moment, then what'll happen is again the recovery workstation, of course the recovery workstation doesn't know really the point at which the workstation crashed. All it sees is oh, here's some log entries and again the recovery workstation will replay the log in the same way and more or less what's going on is that yeah, even if the modifications were already done in petal, we're replaying the same modifications here, the recovery workstations replaying the same modifications that just writes the same data the same place again and presumably not really changing the value for the rights that had already been completed but if the crash workstation hadn't done some of its rights then some of these rights we're not sure which will actually change the data to complete the operations. That's not actually as it turns out the full story. And today's question sets up a particular scenario for which a little bit of added complexity is necessary. In particular the possibility that the crashed workstation had actually gotten through this entire sequence before crashing and in fact released some of its locks or so that it wasn't the last person, the last workstation to modify a particular piece of data. So an example of this is what happens if we have some workstation and it executes say a delete file, it deletes a file say a file F and directory D and then there's some other workstation which after this delete creates a new file with the same name but of course it's a different file now. So workstation one, sorry workstation two later create, let's create same file, same file name and then after that workstation one crashes. So we're gonna need to do recovery on workstation one's log. And so at this point in time, maybe there's a third workstation doing the recovery. So now workstation three is doing a recover on workstation one's log. So the sequence is workstation one deleted a file, workstation two created a file, workstation three does recovery. Well, could be that this delete is still in workstation one's log. So workstation two may workstation one crashed, it's gonna go workstation three's gonna look at its log, it's gonna replay all the updates in workstation one's log. This delete may, the updates for this delete, the entry for this delete may still be in workstation one's log. So unless we do something clever, workstation three is going to delete this file because this operation erased the relevant entry from the directory, thus actually erasing, deleting this file, it's a different file that workstation two created afterwards. So that's completely wrong. What we want, the outcome we want is, workstation one deleted a file, that file should be deleted, but a new file with a different name should not be deleted just because it was a crash and a restart because this create happened after the delete. So we cannot just replay workstation one's log without further thought because it may essentially, the log entry in workstation one's log may be out of date by the time it's replayed during recovery. Some other workstation may have modified the same data in some other way subsequently. So we can't blindly replay the log entries. And so this is today's question. And the way frangipani solves this is by associating version numbers with every piece of data in the file system as stored in petal. And also associating the same version number with every update that's described in the log. So every log entry, first, I don't have any, let's say in petal, I'll just say in petal, every piece of metadata, every iNode, every piece of data that's like the contents of a directory, for example, every block of metadata in stored in petal has a version number. When a workstation needs to modify a piece of metadata in petal, it first reads that metadata from petal into its memory and then looks at the existing version number and then when it's creating the log file describing its modification, it puts the existing version number plus one into the log entry. And then when it, and if it does get a chance to write the data back, it'll write the data back with the new increased version number. So if a workstation hasn't crashed and it did, or if it did manage to write some data back before it crashed, then the version number as stored in petal for the affected metadata will be at least as high or higher than the version number stored in the log entry. And it'll be higher if some other workstation subsequently modified it. So what'll actually happen here is that what workstation three will see is that the log entry for workstations one delete operation will have a particular version number stored in the log entry associated with the modification to the directory, let's say. And the log entry will say, well, the version number for the directory and the new version number created by this log entry is version number three. In order for workstation two to subsequently change the directory that is to add a file F. In fact, before it crashed, the workstation one must have given up the lock and the directory. And that's probably why the log entry even exists in petal. So workstation one must have given up the lock. Apparently, workstation two got the lock and read the current metadata for the directory, saw that the version number was three now. And when workstation two writes this data, it will set the version number for the directory in petal to be four. Okay, so that means the log entry for this delete operation is gonna have version number three in it. Now when the recovery software on workstation three replays workstation one's log, it looks at the version numbers first. So it'll look at the version number of the log entry, it'll read the block from petal and look at the version number in the block. And if the version number in the block in petal is greater than or equal to the version number in the log entry, the recovery software will simply ignore that update in the log entry and not do it because clearly the block had already been written back by the crash workstation and then maybe subsequently modified by other workstations. So the replay is actually selective based on this version number. The replay, it's a recovery only writes, only replays a write in the log if that write is actually newer, the write in the log entry is newer than the data that's already stored in petal. So one sort of irritating question here maybe is that workstation three is running this recovery software while other workstations are still reading and writing the file system actively and have locks and who knows what are talking to petal. So the replay is gonna go on while workstation two which doesn't know anything about the recovery still active and indeed workstation two may have the lock for this directory while recovery is going on. So recovery may be scanning the log and need to read or write this directory's data in petal while workstation two still has the lock on this data. The question is how do we sort this out? Like one possibility which actually turns out not to work is for the recovery software to first acquire the lock on anything that it needs to look at in petal before while it's replaying the log. And the one good reason why that doesn't work is that it could be that we're running recovery after a system-wide power failure for example in which all knowledge of who had what locks is lost and therefore we cannot write the recovery software to sort of participate in the locking protocol because all knowledge of what's locked and what's not locked may have been lost in the power failure. But luckily it turns out that the recovery software can just go ahead and read or write blocks in petal without worrying about sorry, read or write data in petal without worrying at all about locks. And the reason is that if the recovery software and the recovery software wants to replay this log entry and possibly modify the data associated with this directory, it just goes ahead and reads whatever's there for the directory out of petal right now. And there's really only two cases, either the crash work station one had given up its lock or it hadn't, if it hadn't given up its lock then nobody else can have the directory locked and so there's no problem. If it had given up its lock, then before it gave up its lock it must have written its data for the directory back to petal. And that means that the version number stored in petal must be at least as high as the version number in the crash work station's log entry. And therefore when recovery software compares the log entry version number with the version number of the data in petal it'll see that the log entry version number is not higher and therefore won't replay the log entry. So yeah, the recovery software will have read the block without holding the lock but it's not gonna modify it because if the lock was released the version number will be high enough to show that the log entry had already been sort of processed into petal before the crash work station crashed. So there's no locking issue. All right. This is the, I've gone over the kind of main guts of what petal is up to. It's cache coherence, it's distributed transactions and it's distributed crash recovery. The other things to think about are the paper talks a bit about performance. It's actually very hard after over 20 years to interpret performance numbers because they ran their performance numbers on very different hardware in a very different environment from anything you see today. Roughly speaking, the performance numbers they show are that as you add more and more friend Japani workstations, the system basically doesn't get slower. That is, each new workstation, even if it's actively doing file system operations doesn't slow down the existing workstation. So in that sense, the system, at least for the applications they look at, the system was giving them reasonable scalability. They could add more workstations without slowing existing users down. Looking backwards, although friend Japani's full of very interesting techniques that are worth remembering, it didn't have too much influence on the evolution of storage systems. Part of the reason is that the environment for which it's aimed, that is small work groups, people sitting in front of workstations on their desks and sharing files, that environment, while it still exists in some places, isn't really where the action is in distributed storage. The action, the real action is moved into big data center, big websites, big data computations. And there, in that world, first of all, the file system interface just isn't very useful compared to databases. People really like transactions in the big website world, but they need them for very small items of data, the kind of data that you would store in a database, rather than the kind of data that you would naturally store in a file system. So, some of this technology might sort of, you can see echoes of it in modern systems, but it usually takes the form of some database. The other big kind of storage that's out there is storing big files as needed for big data computations like MapReduce. And indeed, GFS is, to some extent, looks like a file system and is the kind of storage system you want for MapReduce. But for GFS and for big data computations, Franjapeni's focus on local caching in workstations and very close attention to cache coherence and locking is just not very useful. For bulk data read and write, typically caching is not useful at all, right? If you're reading through 10 terabytes of data, it's really counterproductive almost to cache it. So, a lot of the focus in Franjapeni is sort of time is pass it by a little bit. It's still useful in some situations, but it's not what people are really thinking about and designing new systems for. All right, that is it.