 All right, welcome back everybody. We are going to continue where we left off and we were talking about file systems and transactional support to get better behavior. But before we go there, I wanted to remind you a little bit of a discussion that we were having. We were talking about RAID as a very simple option to account for disk failures. And then I talked about a more general option called RAID Solomon codes, which are very much like polynomial fitting. And the polynomials are done with Galois fields and it's just like what you remember from your original algebra when you first got it in elementary school or whenever that was. But the point is that with RAID Solomon codes, we can actually decide an arbitrary amount of redundancy. And in fact, we can set up something that's almost like a hologram where we can send out n little fragments and as long as we get any m back, we can reconstruct the data which can make it extremely hard to destroy. And I gave some examples. For instance, if the size of the symbols in the Galois field is 16 and we have six disks and we wanna make essentially a RAID six out of this, tolerating two failures, we just split the data into four chunks, encoded in 16 bits at a time, generating six points on the polynomial. And so as a result for every sort of four chunks of data, we get six chunks that we record on separate disks and as long as we get back four out of the six disks, we win. More interesting though is the extreme version of this. For instance, we could split data into four chunks and produce 16 total chunks. So that means that any four of 16 is going to basically recover our data. And what's nice about this technique is this is not copying. So if we copied 16 chunks around, we would have 16 times the overhead. This scheme has four times the overhead, okay? And I showed this brief graph here which is kind of inspirational. The idea here is that you put your data out, spread it widely around the world in what these graphs are showing you is essentially that for instance, this top one is four copies of the data which is a factor of four. And this six says that every six months we look at our data out there in the net and if we don't have four copies of it, still we make up for the lost ones and we just keep refreshing over and over again. And if we do that, then it essentially gives us that we'll essentially lose about 3% of our data per year under certain assumptions. Now, don't worry about those assumptions but what's more interesting here from this graph is if instead we do this erasure coding that I told you about with Reed-Solomon codes where we send out 64 fragments, any 16 of which are sufficient and we reconstruct every six months, now we're down to 10 to the minus 35 blocks of data lost a year. And so this extreme difference is a very interesting one and it's purely based on this type of encoding. All right, now there's some strong assumptions here about independence of failure and so on but we could talk about those offline. Now, the other thing just to remind you about transactions are the idea basically of starting a transaction, doing some things and ending the transaction is really going after some properties that we want in general from transactions. These are usually called the acid properties. So one is atomicity. So either everything in the transaction happens or nothing happens. Consistency means that when the transactions do happen the data itself stays, keeps its integrity. Isolation basically says that one transaction appears to have happened by itself regardless of the fact that there are many of them going on simultaneously and durability says that if a transaction commits then its effects persist despite the crashes. And regular databases, if you were to take a database class would talk about how to get all of these and actually in 262 we talked about that as well. But what's pretty interesting for us is this atomicity and durability becomes very important for file systems. And we would like to make sure that when a file system machine crashes that the file system itself stays in a clean state. And so that was kind of where we were ending up last time. And we had talked about how the fast file system tries to order rights and updates in a way that's likely to be recoverable. But we can do better and we start with a log. And this log here for instance shows you a series of things like grabbing $10 from an account, $7 from another one, et cetera, et cetera, et cetera. These actions need to either all happen or not happen. And they need to do that even though all this other stuff's going on so you might have other transactions going on simultaneously. And the way we do that is pretty simple. We put a mark in this log saying start, we write everything that we wanted of our transaction and then we write end. And until we say end or commit, it doesn't matter that this stuff's in the log, it's gonna get ignored under crashes. And the moment we do this single write, all of a sudden this now becomes valid and it will have happened atomically. And if we can use the idea of a log to augment our file systems, then maybe we can get this atomicity property where now for instance instead of adding and subtracting from an account, you can imagine that these are operations like allocating new iNodes, adding something to a directory, writing some data. And the commit says, oh, all of those things now happened or if that doesn't get there and we crash then none of those things happened. And that's gonna be our goal here is to add atomicity to our updates. And so journaling file systems are exactly that. So instead of modifying all the data structures on the disk directly, like we've been talking about with the FAS file system and TFS and so on, instead what we do is we write our intention to make changes in the log first. And so we put a set of those intentions in there and we append the log. So it's an append only record and we push that log to disk and then we write a commit, okay? And as soon as that commit is pushed to disk, then our updates occur atomically. Now notice that what I didn't say is we haven't actually updated the file system. So what we've done is we put something in the log, we put an atomic commit and now the file system needs to know to look into the journal for things that have been committed but not yet put in the file system so that we get the semantics as if this was an atomic commit. And of course that's really easy as long as the system doesn't crash because we have all the in RAM state from the buffer cache. And so when we make our commits, we just put them in the buffer cache as changes or make our changes, we put them in the buffer cache. And then when we say commit, the thing that's definitely on disk is the log but now the cache has what we need in it. And so the reads and writes can occur as if this actually happened to the file system. So basically once the changes are in the log, it's safe to apply the changes to the data structures on the disk. And we can do that in the background if we want slowly because the ground truths state of the system is in the log itself. And then once the changes are copied, then we just, we can truncate the log and remove the things, the transactions we've already committed. And if the last atomic action is not done and we don't put a commit in it, then if we crash and restart, then none of those changes in the log will have been applied to the file system and we can just ignore them. And so that gives us the second half of the either it all happens or none of it happens. And that's how we do that. So the basic assumption that we're gonna use here is that updates to sectors are atomic and ordered. It's not necessarily true unless you're very careful about how you use the file, how you use the disk itself. Now there's a question here. So the actual writes or changes are happening in the cache and only flushed if the log is correct, that's right. So you only make the changes into the buffer cache and then you only allow them to actually change the on disk state when they've successfully been applied in the log and you have a commit record. Okay, so you could imagine there's, you have to be a little bit careful about how you label the buffer caches and so on so they don't accidentally go out. Now, what happens if the system crashes when we're writing the log to disk is another question and the answer is, well, the right thing, which is if the system crashes and the log doesn't go all the way to disk, then when we restart, that last record will be missing and therefore none of the things that are in that commit record will have been put into the file system and the file system will be consistent, okay. And similarly, I see a lot of questions about the log here. So think of it this way. We have this structure on disk and what we do is we say I'm about to start a transaction. So what happened was a user made a system call. They said, I wanna do, I wanna create a new file and write some blocks to it. We start the transaction. We do all the updates as records in the log. We say, do this, do this, modify these bytes, do that, change the free list. And then we say commit and only after we have pushed this log piece entirely to disk successfully then we return to the user and say, okay. So as far as the user is concerned, nothing has happened until we return to them. So if we crash at any time up to this, then the user doesn't care because they've finished. And if we don't get everything up to the trends and including the commit record in there, then the data will not be committed to disk in the file system. On the other hand, if we do get the commit record in there and crash afterwards, then we can keep restarting as much as we want and we'll reapply these items until we can get all the way through to the commit record and then we can discard these from the log. Okay, so let me give you an example here. So suppose we're creating a file. Now in the old days, let's say for the fast file system, we found a free block. So if you notice here, we found a bit in the free list and we find a free iNode entry, okay? So we mark a block as not free. We mark, grab something in the iNode table. We find an insertion point in our directory. We mark things as used. We write the iNode entry to point to the blocks and then we write the directory entry to point to the iNode table. And when we're done, we basically have created a new file. Now, last time we talked about how the fast file system and similar things like EXT2 try to do this in an order where if you don't make it all the way to the very last stage, then that FSCK operation will go through and find the dangling pointers and be able to recover them properly and it's only if you make it all the way through that the file is fully created. The problem is that that requires a lot of thinking and you gotta be very careful to get it right. And there are some things that are essentially impossible to get right. So what we can do instead, which I was just talking about is let's have the journal down here. And the journal is basically a chunk of contiguous tracks on disk or in some systems it can even be a file. And what we have is a head, a tail and a head. And the head is the point at which we're gonna write new entries and the tail is the earliest or the latest point in the log, excuse me, that we haven't applied yet to the file system. Okay, so now when we go to find a free data block, yes, we find it, but then and we find an iNode entry and we find a directory position. What we do is we actually put a start record. Notice these are kind of shaded because we haven't actually done anything on the disk. We put a start record in there. We keep track of the new bitmap for freedom. We keep track of the new iNode table entry. We keep track of the new directory data. Oops, that pointers off a little bit. And then at the very end, we write a commit. And it's only after we write that commit that now this file has been created. And notice, I say that, I write the commit, it's forced to disk. This file is gonna act as the system is gonna return all to the users as if that file's been committed and written even though it's only in the log. So that requires some cleverness on the part of the file system. Not too much cleverness though, because all we really have to do is make sure that their buffer cache is updated properly without updating the file system until after commit. And so it's really that commit point that represents the point at which now we have this view as if the file was committed. And if we crash after this commit without having applied to the file system, that's okay because the recovery starts up and it starts scanning. It knows where the tail is. That's kept somewhere safe and it starts scanning through for all of the full commit transactions that haven't yet been applied. It applies them and then it eventually moves the tail onward and clears out the journal. So here for instance, now after we have the commit, now all that has to happen in the background or on reboot is we now start redoing these changes by actually applying them to the physical space, okay? And so we actually do the physical disk itself file system. And so the log entry each one of them gets applied and eventually all of the blocks on the disk are updated and at that point the tail is moved and now this commit record is thrown out, okay? And so this journal is really a type of log that's only used until it's used to give us good semantics, transactional semantics and the data in it is only kept there until we've applied it to the real file system, okay? Questions. So the question here is when is the logger journal itself flush to the disk? Well, you need to make sure that the journal is flush to the disk before you give back to the user that things have been committed. And so you basically are flushing out and then you return to the user and say, okay, you're ready. Now the question about, do you have to read the journal for things like allocating the free map? Yes. So the journal is the place where you are journaling pretty much any change to the file system itself, any reallocation of blocks, any data rights. So far, I'll modify that in a moment. Any changes in inodes, et cetera, all have to be put into the inode table, all right? Now let me just show you, we'll talk about the expansiveness of that in a moment because if those of you that are on top of things realize we're talking about writing everything twice, right? Now during recovery, what happens here is we start scanning the log and so notice that we're scanning to detect a transaction start without a commit record. In that case, we're just gonna do nothing, okay? And so in that case, this transaction which was started but never completed, never affects the state and we don't have to worry about the file system being an inconsistent state, okay? And we discard partially entered transactions and the disk remains unchanged. Now, here's an example where we had a commit and then we crashed, file system hasn't been updated. Now as we're scanning through, we actually find a complete transaction. So we just go ahead and redo it applying to the disk itself and then eventually throw all this out. And what you'll realize is that the action of recovering and applying these redo records to the disk itself is essentially the same thing that's happening in the background when it's not crashing which is basically where we're freeing up space in the journal by actually applying things to the disk itself, okay? So why go through all this trouble? Well, updates are now atomic even if we crash. So updates either get fully applied or they're discarded and all physical operations that we're gonna do are treated as a logical unit. So what's nice about this is yes, we could try to do the ad hoc thing I mentioned with fast file system which is you go through and you try to apply things in some careful order that you can always detect that it was only partially done and undo it but not only is that tiring and hard to get right but it won't be right, okay? So it doesn't matter how good you think you are you're gonna get it wrong and this is a nice clean way to get atomicity out of your file system. Isn't this expensive? Well, it is, we're now writing all the data twice once to the log and once to the actual data blocks in the target file. Now, one of the things we can do is if we're really paranoid like this and we want to have all our data written twice as well, we can have a separate disk for the log and notice that the log is actually being written totally sequentially and so it is operating at the highest bandwidth that the disk can support which is very good. And so the log can be pushed out potentially a lot faster than you can actually write the data to the file system and so this expanse of writing twice especially if you actually have two separate disks one for the journal and one for the file system can actually be something that's fully overlapped and perhaps you don't even notice it from a performance standpoint. Alternatively, modern file systems really offer their option to journal only the metadata. So things like changes in the free list, changes in the inodes linking into directories. Metadata pretty much is everything but the data itself and there is a mode and in many cases the default Linux mode is this one where all you're doing is journaling the metadata and the file data instead goes directly to the blocks that are getting written. The upside of this is you're not writing all your data twice. The downside is that when you crash it's possible that the data that you wrote is not gonna have made it to the disk yet, okay? And so it's important to basically keep that in mind. We're basically making a decision that it's okay to have the data corrupted but the metadata not. And that decision is pretty much one that says it's far worse to lose the actual structure of your file system and the number of inodes in it and so on and the pointers to the blocks than it is to lose the data. Perhaps you buy that, perhaps you not. Now the question here is is this why it's safe to pull out USBs without ejecting them? Pretty much the reason it's okay to pull things out without ejecting them is that the modern systems are trying to do their best to push stuff out to disks. So this is, I would still not be comfortable just yanking out a USB drive without trying to eject it. The fact that you can pull it out without losing data is probably because you've waited long enough that the file systems decided to push it out but I think you're biding your time till the disaster starts. Okay, so that's a journal. Now, there's something slightly different than a journaling file system called a log structured file system. And I wanted to make sure we talk about this. So the log structured file system is one where there is no file system of the normal structure, there's only the log, okay? So the log is what's recorded on disk. There's no fast file system style attempt to have block groups with data linked in a way that reads well and so on. You basically, every time you make a change to any iNode or any other part of the file system you just record it in the log and at the same time you record all the altered iNodes and metadata and they just point into the log, okay? And so index iNodes and directories are written into the log. The winning grace on this is that we assume that the buffer cache is large. So the fact that our file becomes fragmented all over, I'll show you a picture of this in a second so you can get a better idea of it, is okay because most of the reads are handled by the cache and therefore it's fast. And we're essentially doing everything in bulk. So the log is a collection of really large segments that are all sequentially written on the disk. So what we get out of this is we get the fact that all writes, no matter where you're writing in the file system or where in a file and whether you're writing randomly or not are basically recorded as one long sequential run on a log which is very fast because we're getting full sequential access out of writes. The reads, we're gonna get much worse access but we're gonna assume the cache helps. So every segment contains a summary of all the operations in the segment and the segment is basically where we're gonna do our garbage collection later to see kind of what parts of the log can be thrown out without changing the file system. And so free spaces basically are gonna be a cleaning process. Now I wanted to give you a disk, a picture to get a better idea. And by the way, I have the log structured file system paper in the resources, you can read this, the original one that was part of the Sprite system here at Berkeley is something I show you here on the left and the regular fast file system on the right and the idea here is that let's suppose that I create some files in a direct, create file one and directory one and file two and directory two, if you notice here in the case of the log structured file system, what I'm gonna do is I'm gonna write the first block of file one and then I'm gonna write an updated iNode for file one or a new iNode for file one in the log and then I'm gonna write an updated block in the directory and then I'm gonna change the directories iNode to point to that new block and then I'm gonna do file two and directory two and so on. And if you notice this, all of these things that would be spread all over the fast file system are actually all sequential and so I get really fast write access. And in the fast file system, we have directory iNodes kind of on the outside of a block group and the file iNodes are close to them somewhat but then the file blocks are inside the block group both for the file and the directory itself. And so, and if the file has a bunch of blocks in it they'll all be sequential, hopefully if the fast file system does a good job of allocating. And as a result, what we see here is essentially that this layout is optimized for fast access later after the writes happen. So if I go to my directory and I do an ls-l on all the data, I'm essentially gonna get within a block group, things are gonna be very rapid to read all the data in the file and in the directory and maybe file two is in a directory that's in a different block group but they're all optimized for locality of reads. The log structure file system is optimized for locality of writes. And so for the fast file system you can see what really happened is when I opened the directory one I created the new iNode, I had to seek possibly for creating the, for getting a data block for the directory. And then I create a new iNode for file one itself having put it in the directory and then I find a new block for file one and I do the writes and then similarly I do the accesses for the directories in file two. And so I'm doing seeking back and forth. Now it's within a block group, so it's not very fast but or not very, not too expensive I meant to say but it's still seeking back and forth whereas the log structure file system is one quick write. So reads are the same in either file system essentially pointer following. It's just that the iNodes don't have a constant position in this file system. The only thing that has a constant position is the top level root iNode. Everything else I have to go to an iNode table to find out where the iNode currently resides and then I can start following pointers within the log to get the data. In the case of the fast file system I'm still pointer following but the file itself stays in a fixed place. And the buffer cache is likely to hold information in both cases and so hopefully the reads are fast but if you happen to have to go out of cache you have very different behavior. You're gonna get nice hopefully sequential behavior out of the regular fast file system maybe very random behavior out of the log structured file system. Now why do it this way? Well the answer is in the original paper that this way we get really fast write access and so if you have a database it's doing a lot of writes to random items. You're gonna get really high performance out of the log structured file system and you're gonna assume in cases where you really did need good read behavior hopefully there's enough DRAM to get a fast cache access out of your buffer cache. In the context of what we've just been talking about you can build your journal out of the log itself and so you can basically get transactions by putting transactions in this log so that new changes only are there after you hit a commit. So you can combine your block management and your block management and your transaction management all in one log. So questions, probably a good question that you ought to be asking is does anybody use the log structured file system? You know it's not common common in spinning storage. I'm gonna tell you a point where it is very common in a second but you can actually find a log structured file system module for, excuse me, things like for Linux for instance. Now where do these ideas, can anybody guess where these ideas might really show up? Yeah, so there's a question about Borg. So places, there are log structured storage used in many places and this can work for archival storage as well as things in the cloud. So yes, it can be used for internal logging. The question is tape storage. Well, yeah, I suppose, but nobody uses tapes anymore. Could it be used for databases? Perhaps, although the database papers that we read, we read a key paper called Aries in 262 that shows how this block management can be done in a really clean way without having to use the log this way. And so in databases, the log is usually just used for transaction management, not for storage. 186, that's a good guess. However, here's the place where this might make sense. So the place where it makes sense is when you wanna have this idea of kind of writing over all of your storage before you write over it again and when reading randomly, and this gets pretty random when you think about the fact that part of your file may have been written weeks ago and so it's therefore far back in the log, you want something where random reads aren't gonna be a problem and that's entering flash. So I did promise a little bit of flash storage to you guys. And so the F2FS is an example of a flash file system. I put a paper up on this one too. This was originally designed by Samsung and has been incorporated in things like the Pixel 3 from Google and so on. It's an active file system, supports block encryption for security and has been mainstream in Linux for, I don't know, five years at least and you can look it up. And basically it's assuming what we talked about before with SSD which is with a built-in flash translation layer in the controller itself and that random reads are fast. Okay, so that's, they're as fast as sequential reads because we don't have any seeking or rotational delay. But what's interesting and this is a point that they make up in this paper, random writes are actually bad for flash storage. Let me explain. If you remember when I talked about flash storage a couple of lectures ago, what I told you was that when you write, you can never overwrite. What you can do is if you write a block over again you actually have to copy its current context, contents to a new clean page and with your changes to it, leaving the old page basically to be garbage collected at some point in the future when the flash translation layer can take a bunch of pages that are together and do an erase on a page-based basis. So if you're randomly writing all over the place you're actually invoking this continuous copying process that actually wears your flash out a little faster than it might otherwise and it'll also degrade your write performance a bit because the flash has to keep erasing blocks. So this is basically an interesting file system that's kind of taking advantage of things that are very different than disks. So random writes being kind of hard on the storage and random reads being fast. And so basically what it does in the paper, you should go take a look at it. I've got it up on the resources page, minimizes writes or updates to whatever possible amount that it can. So that's trying to avoid wearing out RAM and otherwise tries to keep the writes as sequential as possible. And if you think about a log structured file system what we're doing is we're taking a bunch of changes and what we're doing is we're just walking them through and writing them to the log not necessarily in block sizes. So the question is, does this mean that all or virtually all SSDs use when you say this interface? Do you mean the flash file system or the interface that I mentioned up here? So that was a question that was on the chat here. So basically SSDs have this behavior to it. Okay, so they all have this flash translation layer fast reads and very random writes are bad for them. Flash file systems don't all take advantage of this in the same way as F2FS. So there's several different flash file systems. This is just one. I thought I'd point this one out because it's interesting. It basically starts with the log structured file system exactly what I showed you just a moment ago. And if you notice, if I write my changes as diffs in the log rather than necessarily per block I can even keep my number of pages being written very small by writing sequentially in the log. On the other hand, if I wanna keep to the page boundaries I can still by writing sequentially what I'm doing is I'm gonna hit all the pages in the same block and then go to the next set of pages and as a result blocks tend to be used sequentially and can be kind of erased when they get old. And so it kind of works well in the block standpoint. But basically it starts with the log structured file system where we have to do copy on write anyway because of its very nature keeps the right sequential. And one of its key ideas that the F2FS has is this node translation table for logical to physical translation which is independent of the FTL. So the file system translates a logical storage name to a flash storage name and then the FTL translates under the covers again, so there's two levels of translation and you should check out the reading section. And it's kind of fun because it compares with some systems you're likely to run into like EXT4 and ButterFS and a few others. But here's its layout and I just wanted to give you this so you can see that file system design can get very interesting. So what they do is they have a set of blocks in the file system which they assume are kind of hot and are gonna be written a lot, but there's not very many of them. And then they have a bunch of different logs each of which are sorted into hot, warm and cold regions of the log where a hot region is one that has a lot of data going to it quickly and a cold region is one that has a very slow writes and doesn't get updated too often. And they basically divide things into segments just like we talked about with the log structured file system a little bit earlier for doing garbage collection and 4K blocks and blocks are typed to be node or data. And so this node translation table basically translates a name that the file system has for something like an inode into the flash name before it talks to the flash controller which then translates it to the underlying system. It updates data sorted by a predicted write frequency and you can take a look at the paper to see what that's about. It has checkpoints to keep the file system status stable so that if something crashes in the wrong place we can go back to a checkpoint. And then the segment information table is actually used to say which of these segments need to be garbage collected and so on. Now I did wanna put one kind of thing out here. So I talked about this node address table and I wanted to say a little bit about why this is interesting. So if you look, here is an inode structure that you're used to. So bring your brain back to the fast file system for a moment. We kind of have the top level, they call it a checkpoint, but the top level point that points to the top level directory and that sort of has inodes for the directory and directory data. And at some point, it also there's inodes for regular files and file data. And those inodes point to blocks, possibly indirect blocks, which point to direct blocks ultimately, which point to the file data. And this is in the case of a log structured file system, here's the log. And what happens when you update this file data is you update it by writing to a different block, which means that you have to point, you have to update the direct pointer to point to that new block and the indirect to point to that direct block and the inode to point to the indirect block and the inode map to point to that and so on. And up top it has to point to the updates. And that's basically what happens in a log structured file system because each one of these inodes, when you modify it, you change their position because they're now in the log somewhere. And so this means that there's a lot of changes that are also causing data to be written in the flash, which is wearing out the flash. So this would be a normal log structured file system, but if you think about it for a second, we don't really need to do all that because we put this translation table in here and the names that we have for our pointers are logical names that we look up in the NAT to find out what the current physical node is. And so when we alter our file data, what we do is we alter a pointer, say we're writing a new block, we alter a pointer in the direct node to point to the file data, but now that node is being referenced by a logical name that we can look up in the NAT. So we only really have to update a couple of blocks in the flash and a little bit of data in the NAT for the very leaves of the tree. And so there's many less things being updated. And so this is an example of how carefully planning the way you lay out your file system might be directed to try to drastically eliminate the number of writes there to the file system to wear it out. And the other thing that they do is they have different logs and so they have logs for direct nodes and logs for data and logs for indirect nodes. And the reason for this is that each of these things have different speeds at which they're updated. So you can see that direct nodes might get updated frequently, but indirect nodes not so frequently. And so as a result, each of them has a different part of the flash where they're writing a log. And so again, we're kind of coalescing together things of the same speed so that we don't garbage collect quite so frequently. Only the ones that are getting updated frequently need to be garbage collected frequently and that thereby saves flash storage as well. All right, that was a whirlwind discussion here and there's a lot more in the paper, but it gives you kind of an idea of the very different point of view that people have when they start thinking about flash, that different point of view, basically dealing with the fact that reads are fast no matter how random they are, that we have to be careful about overwriting data too frequently and that we even want locality in our writes to try to help the flash controller do a better job of garbage collecting. All right. So moving on, to change the topic here now, I wanna talk about distributed storage and systems as a whole and I'll start by mentioning, this is a very old slide of mine, but I always like to use it, which is to think about the fact that really, the world is not made up of a bunch of individual systems like your phone and in your car and whatever. What it is is it's one huge system all interconnected. And it's everything from MEMS, like these little MEMS cockroaches that were made by folks in Cori many years ago that could walk and so on, sensors, et cetera, through to cars, modern cars have hundreds of CPUs in them on up through smart refrigerators, into data centers and the cloud as a whole and pretty much this whole thing is one big distributed system. And you should really treat it as such because then we can start asking about where do we put caching in this system? Do we try to keep our queries mostly in the local level but then occasionally go up to the cloud and so on. And so there's this vast infrastructure really forms what I like to think of as one huge scalable parallel system. All right. And so that leads us to this interesting question which is one of centralized versus distributed systems. So up till now in the class, we've been really talking about something like this where you might have a centralized thing that keeps track of all of the data and that's our file server or our, you know, this is the thing we log in to get our data and everything's happening at this one server side and the client talks to that server and versus a distributed system where kind of there are many parts of the system and they all work together to give you a coordinated response but you have to figure out how to make sure that that vast array of different machines working together gives you something that makes sense. So this is the peer to peer model. This is the client server model and the distinction is really whether there's one entity that can serialize or make decisions at the center of the system and we'll call that a server for now. And we have that in client server model not in the peer to peer model, okay? And the early model of this peer to peer system was multiple servers working together probably in the same room. Today's systems potentially these are spread all over the globe and they're contributing to the global system. And we'll talk about some peer to peer storage systems make sure that you have a good idea of how things like core work and so on to give you key value stores. So what's the motivation behind this? So this is distributing our systems and in fact the distributing is potentially could be really drastic. We could spread this all over the globe with thousands of nodes if we wanted to and why would we do that? Well, part of the reason that you might wanna do this is it might be cheaper and easier to build lots of simple computers than a really big expensive one. And this was actually Google's claim to fame long ago. I'll say even 10, 15 years ago. They started building their own super cheap computers that then they would put into big machine rooms. And the reason they did that was so they didn't have to pay for really expensive machines that could all support many users at once. Instead they had many, many, many simple machines that perhaps were single threaded or not really single threaded but not doing as many things at once. And it's very easy in this sense then to add power incrementally. So if you don't have enough, you just add some more. Whereas with the big system, typically you buy a really big rack from some company and it's hard to just add stuff to it. And users may have complete control over part of the system, that may be a goal. And potentially it's much easier for collaboration through network resources if we started out by designing the system for the network in the first place. So the promise of distributed systems is really higher availability. So there's more components up. So the likelihood that one is there running is high. Better durability, well, I've got things spread all over. So by using something like read Solomon erasure codes I can spread my data all over the place and get great durability. And potentially more security because each piece is simpler and maybe I can make it secure if it's doing something very simple. Now, you can see I've got the word promise here. I tele-sized it in red because in reality it doesn't really work that way. So reality has been very disappointing. So we get worse availability because you depend in some cases on every machine being up or particular machines being up. This is Leslie Lamport. He's somebody that you should be very familiar with a well-known computer scientist. And he has this quote, which I love which is a distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. I'm sure many of you have experienced that. The reliability might be much worse because you could lose data if any of the machines crash rather than one. And of course the security is potentially worse because anyone in the world can break into the system because it's made out of so many components. And so you gotta be really careful when you go about building a distributed system. By the way, Leslie Lamport is famous for many things including Paxos, which is a consistency protocol. We'll talk briefly about visiting agreement. All of these things came directly from him. So we read a lot of papers about him from him in 262. The other thing is that coordination is much more difficult when you have many items, many computers all working together. So you must say coordinate multiple copies of shared information so that you don't have inconsistency. Things that are simple to even think about and simple to do in a centralized system with a server is now suddenly a lot more difficult and you have to think really hard about them. And of course you run into trust, security, privacy, denial of service problems. There's so many new variants of the problem of securing your information arise as a result of distributing that information. Can you trust the other members of a distributed application enough to even perform the protocol correctly? Corollary of Lamport's quote, a distributed system is one where you can't do work because some computer you didn't even know existed is successfully coordinating attack on my system. So welcome to the world of distributed systems. Now I started out with a pretty bleak view of this for all of you but I wanna point out that distributed systems are the norm and people have figured out how to get by many of these problems and today the distributed systems are what we do but you have to be very careful about how you approach them. All right, questions. Now what are some goals of building a system like this? Well, one of the goals it's often, oh, is torrenting a distributed system. Yes, the torrents were one of the distributed file sharing systems, BitTorrent being the original one that was a distributed protocol that ran among a bunch of machines that people put their storage and CPU into the system and you got something that was more powerful than the sum of its parts. So that's a great example of a distributed system. We talk about a bunch of different ones in 262 as well. So what are some requirements of distributing things? Well, one requirement is that humans who are not computer scientists keen on how cool a distributed system is are which basically is pretty much everybody but us, right? Don't like things to get more complicated, all right? And so basically one of the transparency goals here is really about trying to make the system appear as simple as possible and that there's a simple interface even though there's many complicated systems out there. And so some possible versions of transparency might be, well, you don't really know where the resources are and you don't care, okay? So all of us have used cloud storage to store and back up our information and that's an example of a location transparency because you don't really need to know where it is unless of course you're worried about it getting compromised or you're worried about how fast it is. You may not worry about migration. So resources may move from one point to another to try to say give you better performance or better durability without you knowing about it or replication, maybe you don't wanna have to know how many replicas are out there and let the system who you're paying for service make replication decisions to make sure your data is safe or concurrency, maybe you don't wanna know how many users there are out there, okay? So when I'm interacting with some portion of the cloud I probably don't want to be impacted by how many other people are out there, okay? Parallelism, so maybe jobs are able to split things up automatically into small parallel pieces and make them faster for me without me having to do anything. Fault tolerance, wouldn't it be great if when things go wrong I don't have to know about it? So rather than the lamp port problem that I can't do any work cause some other machine I don't know about is failing wouldn't it be nice if the system could automatically adapt to that failure pull a new system on automatically and hide that from me? So transparency and collaboration basically requires some way for different processors to communicate with one another and do so in an organized way. And of course that entered the internet that's basically the underlying communication mechanism but sending packets unreliable ones at that between each other is really only the beginnings of the game. What's more interesting is sort of what do we exchange in terms of information to do a good job at that, okay? And so how do entities actually communicate? Well they send messages but more interestingly there's a protocol, okay? So a protocol is an agreement on how to communicate between endpoints and it includes things like the syntax which is how do you specify the communication and format an ordering of messages is included in that and also the semantics. So what does a communication actually mean? Does this communication indicate that for instance I'm now ready to commit some actions that I told you about earlier, okay? And it might be actions that are supposed to be taken and transmission or actions when a timer expires. Now so the question is why would a distributed system have worse security and reliability out there? And the answer is that many distributed systems have many attack points of entry, okay? Yes, blockchains are harder to break into but they also are very slow in many instances. And so blockchain is an instance of a distributed protocol that does pretty well for its domain but in general distributed systems have many components to them and part of that starts with this protocol. Are these protocols secured in a way that somebody can't break into it? And one way to look at this is describing this whole thing formally by a state machine and you could imagine state machines at either end of a protocol exchange and what happens is the protocol causes this state machine to change on either end in a way that happens consistently enough that then I can have the two end points go through a communication consistently, okay? And so one version of a protocol is really to think of it as a distributed state replicated state machine and a number of different folks have talked that way. By the way, I will point out that I'm a big fan of blockchain so don't get me wrong on that. I think they have a definite place, very important. The other thing that's this idea of replicated state machines on either side is really that we need to have stable storage to help make this replication clean and I'll show you the simplest primary example of this two phase protocol in a second where you need to have something like a disk or an NB RAM or whatever so that the state machine state is kept stable so that if one element crashes, when it reboots, it kind of knows where it was and as a result can kind of pick up where it left off. And so some form of stable storage is important to have stability in the face of failures, okay? So this replication is pretty abstract so far but it turns out that many of the interesting things that you might imagine in a distributed system are really versions of this replicated state machine. So examples of protocols. Here's a telephone protocol, right? You pick up the phone, you listen for a dial tone or you see that you have service somehow, you look at the number of bars you got, you dial the call, you hear ringing, the other person picks up and says hello and you say, hi, it's John or hi, it's me. That's the strangest version of the English language I've ever thought of, hi, it's me, well, who's that? So the caller says, hey, do you think blah, blah, blah, the other one says, yeah, blah, blah, blah, and the caller says bye and the other one says bye and you hang up. So really, this is a protocol, right? You hear my dial the phone number, the caller says hello as a way to initiate a communication on the remote side and then at the local side, you continue that protocol by saying hi, it's so and so. If you think about all the cold calls you get, I don't know about you guys, but I get a lot of bad calls on my phone, you pick it up and what's missing is maybe somebody on the other side, part of this, the other side here isn't actually talking to you and you know right away that there's a bad protocol going there. But okay, so then the caller says something and the other caller says yes, you say goodbye, bye, bye and you hang up. So this is actually a set of common protocol messages that you're used to, they're informal, but you tend to use them probably every day. The problem in general is there are many such possible communication mechanisms and many such things over which to communicate and many applications and you very rapidly kind of have an N squared issue for every application and everything you want to communicate on and even worse, maybe make it N cubed or anybody you want to talk to at the other end. What do you do? And the answer is well, you have many applications and many network styles and you organize the mess how? Well, you certainly don't want to re-implement everything from scratch for every application and every technology and instead what you need is when you add a new application suddenly you'd have more work to do. So we all know that isn't what happens. What happens instead is we basically do something else which is we put intermediate layers in this system to help us out, okay? And these intermediate layers provide a set of abstractions for us for various network functionality technologies. And so this narrow waste here, which turns out as IP is a way to basically get reliable communication channels on which to build distributed applications that are agnostic as to what's underneath the covers. They don't care and they work no matter what's up above, okay? And that's really why the internet has taken off so well and it's really these intermediate layers we're carefully designed. And it's called the narrow waste and you've probably seen this. And by the way, when I add a new application I just need to connect up top to the narrow waste and I'm good to go. If you've all seen the internet hourglass I'm sure but there's lots of different applications and protocols at the top, lots of different communication technologies including the actual packets themselves and the protocols on top of the packets all doing IP which is then served to a bunch of other folks above. This is the narrow waste of IP and it's really the way that this all works. And we'll talk a little bit more about this potentially in another lecture depending on how much time we have but really it's this abstraction layer that helps. And the implications of this hourglass are really that there's a single internet layer module IP it allows arbitrary networks to talk to each other. It allows applications to function on all networks. So any application that can run on IP can use any network. And this is really the power of the internet of things has really been out there because they implement the IP stack and as a result they can tie in pretty much anywhere. It supports simultaneous innovations above and below and so you can get vastly new networks without having to worry about it although the changing IP itself is a really painful and slow operation. So it's easy above and below but that middle narrow waste is hard. And if any of you have taken 168 have probably heard the story of IPv6 which was the technology of the future for 20 years until it finally kind of caught on. And it still hasn't entirely caught on. So it's very hard to change the narrow waste but the positive is you get a lot of power out of it instead. And the drawbacks of layering in this way is yes you've got the IP layer but then typically you do a bunch of layers above that. You know IP gives you TCP and then TCP you put a transport layer of TCP and then you put some application layers on top of that and many, many layers. And the problem is every layer adds complexity and performance problems, okay? And headers can start to get really big if you have too many layers. So what I wanna talk about for the moment is something called the end-to-end argument and I'm sure many of you have heard of this before but there's this hugely influential paper which I highly recommend you guys read it even though it's from 1984 it's up on the reading list. It's often called the sacred text of the internet and it's caused endless disputes about what it means so it's almost like philosophy. Everybody cites it as supporting their position. I'll let you decide what that sounds like. And it's got a simple message that some types of network functionality can only be correctly implemented end-to-end reliability, security, et cetera. So what does end-to-end mean? It means that if you have a path that goes through a complicated network and there's lots of things in between there there is functionality that can only be implemented at the application layer, at the source and the destination no matter what you put in the middle, okay? That includes things like reliability, security, et cetera many of those things can only happen end-to-end. And because of this end-hosts can already satisfy their requirements without the network's help so they can retry messages that are missing. They can encrypt the data themselves and decrypt it on the other end. They can do many different things. They can check that the data that they transmitted was correct by looking at a checksum. And so part of what you get out of the end-to-end philosophy as well, they have to do the ends have to do it anyway. So maybe you shouldn't put it in the middle, okay? And therefore, don't go out of your way to implement them in the network if you're gonna have to do them on the end points anyway. All right, now here's an example of reliable file transfer. So host A and B, you're transmitting a file from one point to another. And what happens is the host A reads off the disk. It goes to the application layer at that side and now it gets transmitted to the operating system which sends it across a connection, comes up in the operating system on the other side to the transfer program at host B and host B writes it to disk. And this seems like a fairly simple thing, right? Except my favorite example in the end-to-end paper is when there's actually a router in the middle that looks a lot like one of these hosts and the packets go into the router and then get routed out and there's an application there that's the router application. And there was an infamous example at MIT where there was a problem in one of the internal routers that everybody figured that there was some check summing going on along the various links. And so they didn't worry that data wasn't getting transmitted properly. And what they didn't know is that inside the router what was happening was one byte out of every million got transposed with its neighbor in going into the router and out of the router and they didn't discover this until many, many, many copies of the BSD source code had gone back and forth. And as a result, they had corrupted source files because they were relying on what they thought were reliable links. And the problem was actually in the operating system in one of the intermediate parts, okay? So other than being kind of highly embarrassing that was highly problematic, they had to go to tape to pull stuff back in. So solution one is basically you make every step reliable and then you concatenate them together and you claim that you got a reliable system. Solution two is basically that host A generates a check sum, host B checks it and as a result, you know for a fact whether it made it correctly, okay? So that's the end to end solution as opposed to concatenating a bunch of reliable things. And if you think about it for a moment, so basically solution two sort of brings things in, checks the check sum and so on brings it over the other side. And then we check the check sum, we send the result back and to the application of what was actually received. We pull the file back off disk again, check its check sum. We compare the two, you know, what is on disk from B? What is now on still on disk from A? If they match, we're successful, if not, we're not. And so that's fully end to end. We actually look at what's stored on the disk at A and what's stored on the disk at B and we actually compare them and see whether we got the right answer. Now, if you think about it, you have to do that anyway, no matter what you do. So even if you go to a lot of trouble to make sure that no bytes are ever dropped anyway along the road or they're always connect, maybe always sent correctly, you still have to check the check sums. And so maybe you don't work so hard to make sure that bytes are sent perfectly, that's the philosophy. So the question here is, so end to end doesn't actually worry about the whole path that only relates to the two end points. So end to end is, like I said, it's a philosophy. What it says is, there are some things that I can only do at the end points like checking the final check sums. And if that's gotta be done anyway, then I gotta think very carefully about what do I do in the middle? That isn't to say that I never do anything in the middle because perhaps if the pathway in here is really dropping a lot of packets, what happens is if I send all the data all the way over and say there was a failure and then resend it and there's a failure, it might take a long time before I got something correctly to go through. So it might make sense from a performance standpoint to make it less unreliable and spend some time making it more reliable as it were. But I don't have to make it perfectly reliable. So the end to end argument is really about thinking. It's a philosophy of how to think about systems so that you can decide whether you ought to spend more work in the middle or not. So solution one's incomplete because what happens if memory's corrupted, the receiver has to do the check. Anyway, solution two is complete. Fully full functionality can be entirely implemented at the application layer with no need for reliability from the lower layers. But it doesn't say that you never implement stuff in the middle. What it says is you might do it for efficiency. And implementing complex functionality in the network doesn't reduce the host's implementation complexity because the host has to do it anyway. It does make the network more complicated. So that says do things carefully, all right? Probably imposes delay and overhead on everything even if you don't need all the functionality. Now, the reason I like to say that this is a philosophy is for instance, if you have a really lossy link it makes a lot of sense to put some interlink retry or error correction code for error correction code or whatever to try to give you better overhaul behavior. It's just it doesn't have to be perfect. And so a conservative implementation, by the way, another great example of where it might make sense to do something in the middle is if you're trying to avoid denial of service what you might wanna do is try to recognize packets that are part of an attack and refuse to let them through in the network because in that case the end points are gonna be at a loss to figure that out well because they're gonna be overwhelmed. So the conservative implementation of end to end is don't implement a function at the lower levels unless it can be completely implemented or unless you can relieve the burden from hosts, don't bother. A modern interpretation is think twice. If hosts can implement the functionality correctly implemented in a lower layer only as a performance enhancement and do so if it doesn't impose a burden. And you might say, well, this is from 1984 because it's still valid. Well, denial of service is an example of doing something in the middle that is greatly beneficial to try to help against that. What about privacy against intrusion? Again, that's a form of blocking packets that may not be properly signed. So there may be things that are done in the network still and basically the reason I bring up end to end is so that when you're designing a network distributed application you have that careful thought about am I putting functionality in that's overall beneficial or not rather than let's do everything as perfectly as we can right off the bat in the network. So how do you actually program a distributed application? You need to synchronize multiple threads running on different machines so there's no shared memories. You can't use test and set. And so you have an abstraction of sending and receiving messages. It's already atomic because assuming that we put some sort of checksum on the message itself it either makes it all the way or it doesn't. And so we can use that atomicity in our applications to take advantage of that. The interface is something like often called a mailbox which is a temporary holding area for messages including both destination location and the queue it to go goes in. One way to think of a mailbox might be some combination of the IP address and port. We'll talk more about that a little bit in a future lecture here. The send says send this message to that mailbox and received says here's a buffer wait until I get a message. Okay, so this is gonna sound a lot like what we've already talked about in terms of sockets with a bit of a message header on it to sort of talk about what is a complete message and what is it? Now the question here is why can't two receivers get the same message? So basically your mailbox is gonna uniquely designate what the receiver is. So if you remember in the case of a socket we actually identified for a given IP address which is a physical node we actually identified the application that was supposed to receive it by the port. Okay, and so there's gonna be something that makes it unique. Okay, and so using send and receive behavior for instance should send a message mailbox return to the user when the receiver gets the message. So that would be its act or when the message is safely buffered at the destination or right away you can come up with many semantics here and they're all different. And there's really kind of a couple of questions sort of when can the sender be sure the receiver will get it and when can the sender reuse the message memory? And all of these kind of are questions of interest here and it's also sort of how far up does the application need to make sure in the application stack of the receiver that the message was received. So the mailbox really provides a one way communication kind of from T1 to T2 and it's like a buffer in the network between the two of them. Very similar to producer consumer if you like to think of that if you remember from the beginning of the class. However, you can't really tell in this instance whether the sender receiver is local or not. So you can set up this mailbox so that it's on the local node or it's on a remote node and other than a performance hit you might actually not be able to tell the difference between those two and that sort of gives you a way of dealing with transparency where the locality may not matter as much. So we can use send receive for producer consumer style. So the producer might send a message so while one keeps sending a bunch of messages the consumer can execute a receive and wait until the message comes in and then process it and get the next one. So what I'm showing you here is very similar to the echo system that I showed you at, I don't know, lecture like eight or something in there where we're looking at socket communication where there was a server that was constantly waiting for a query and was processing it. All right, and this is a buffer so you don't really have to keep track of how much buffering is in the network because that's gonna be up to some protocol like TCP IP or whatever which we'll talk about in the future. So what about two way communication? Well, you just have two mailboxes. You have a request and response, read a file in the remote machine, request a webpage and so on. So that's basically two directions of communication. It's also called client server. And so here's an example of a file service where you sort of send off, I wanna read the file root of Vega and you get the response back and now you've done the read. Okay, and the server on the other hand waits until it gets the command for what file it wants, decodes it, reads the file into an answer buffer and then it sends back and now all of a sudden here we have a file server. Now, we're gonna break that out quite a bit more next lecture but you can kind of see that messaging is the basic primitive that if we can get messages to something on the network and get responses back then we can start to build interesting things. And if we have a single server that everybody's using, then we can start talking about consistency that the server is guaranteeing and if we have a big peer-to-peer system with thousands of nodes, then we have to start thinking about what does consistency mean when I could get a response back from any of a thousand possible responders and that's gonna get more interesting, okay? But the messages give us the basic possibility there. So the remainder of this lecture, we're gonna pick this up next time. I wanna talk about distributed consensus making. So the consensus problem is one that arises when we have many nodes and they're all trying to make a decision together and so all nodes propose a value, some nodes might crash and stop responding and eventually all the remaining nodes basically decide the same value from a set of proposed values. This is a distributed decision making where you're as a group trying to choose between true and false or choosing between commit and abort. And an equally important but often forgotten part of this is not only should everybody who's properly running come up with the same decision true or false or commit or abort, but we wanna make sure that that decision can't be forgotten if nodes make a decision and then immediately crash. And so we need a D, durability part of the acid semantics. And in a global system, you know, how do we get D? Well, we just talked about erasure codes, reminded you about them at the beginning of the lecture or massive replication. Blockchain is a great example where thousands of copies of data are kept around the world and as a result, it's really hard to destroy data. And the problem with that massive a level of replication is it tends to make things extremely slow. So, and by the way, I'm very specifically talking about the types of blockchains that were in Bitcoin originally. There've been some less expensive versions that are coming online, quite a few of them in fact. So when we're thinking about this decision making I wanna give you something called the General's Paradox which is stated as a war analogy here. You have two generals and they're on separate mountains. They can only communicate via messengers and messengers can be captured. And so now we wanna coordinate an attack somehow or in the case of a distributed system you wanna coordinate some action that's gonna happen at a given time. And if they attack at different times, then it's a failure and if they attack at the same time, they win. And so the decision has to be about what time will the attack happen so they can all perform that attack. And this is named after Custer who died at Little Bighorn because he arrived a couple of days too early. And I will point out by the way that many of these analogies are in the contests of wars or fighting or battle. I guess that's just kind of the way it is. You'll have to ignore that aspect if you're not particularly fond of it. But let's see if we can solve this problem here. So can messages over an unreliable network which by the way is our messengers getting captured guarantee that two entities do something simultaneously? And the answer is no. Even if all the messages get through. Okay, so here we are and we have two sides of the mountain the first one says 11 a.m. okay? Yep, 11 works. So 11 it is. Yeah, but what if you don't get this act, et cetera. So the issue here is you never know for a fact that both sides have agreed. So the first side is saying, well, should we do 11? The second side says yes. But now at that point, the second side doesn't know that the first side has heard confirmation. Okay, and so they can't go ahead and go because they don't know that the other person will go. And so now the second one is saying so 11 it is and so on. And there's basically no way to coordinate this attack perfectly it's called the general's paradox. And in real life, you'd use something that's a lot faster than messengers like and more reliable like a radio for simultaneous communication. But in reality with networking, we don't have that option. So clearly we need something other than some old natty to do, okay? And so, and by the way, this is even worse if you know, or the same problem if some of these messages get lost. So two phase commit is an option here which we can't agree on time but we can agree to do something or not. So we change the problem here, all right? So can't solve the general's paradox but we're gonna solve this related problem. And the related problem, we're gonna call a distributed transaction because it's quite closely related and used for transactions which is we have a set of machines and they're all gonna agree to do something or not do it atomically, no constraints on time. So we can't guarantee when it happens but just that it will eventually happen and that everybody who's operating properly will do what everybody else agrees, okay? And this was developed by Jim Gray during award winner. He was the first Berkeley PhD and for CS and many important database breakthroughs came from Jim, all right? Go Bears, yep. And so the two phase commit protocol if you guys give me just a few moments I wanna just lay it out and we'll finish this up in a second next time but persistent stable log and every machine is a requirement here because if we have faulty machines it's possible for them to agree to do something and then immediately crash and they need to remember that they agreed, okay? And so when a machine crashes the first thing it's always gonna do is look at the log that it's got on its local disk to see what it's promised to do. So there are two phases one's the prepare phase where the global coordinator so there's gonna be one machine that's the coordinator is going to request it all participants promise to either commit or roll back the transaction. So that's commit or abort or participants make a decision they record the decision in the log and then they acknowledge with that decision and if anyone votes to abort anybody the coordinator writes a board in its log and tells everybody to abort and if that doesn't happen and it gets a commit from everybody then the coordinator at that point is gonna commit and after all the participants respond that they are prepared the coordinator writes commit to its log asks all the nodes to commit they respond with an act and after they receive the acts the coordinator writes got commit to the log. Okay, so the log here basically guarantees that machines either commit or don't and once they've made a decision they are gonna stick with it. And so the algorithm basically has one coordinator and workers and a high level algorithm description here just to finish up before we go here is that the coordinator asked all workers if they can commit if all workers require reply yes then it broadcasts a global commit otherwise a global abort and the workers obey the global message. Okay, and the persistent log is gonna be key here so that if a machine crashes when it wakes up it checks its log to make sure to recover the state and the cool thing about this which is why it won a Turing Award is among other things is that if all workers or is basically that the system here will either have everybody commit or everybody abort and they will all do the right thing. Okay, there won't be half of them aborting and half of them committing even if the workers crash or the coordinator crashes doesn't matter they'll all still do the same thing. And so this is gonna be the first of our distributed protocols that are resilient to failure in the middle and that have a well-defined semantics even though they have flaky components. All right, so in conclusion we've been talking about protocols which is agreement between two parties as to how information is transmitted. The end-to-end argument is really a philosophy. It encourages us to keep the internet communication simple and if a higher layer can implement functionality correctly it should and then only implemented in a lower layer if it either improves performance significantly for that application or provides something like denial of service resilience that you couldn't do at the end points and it doesn't impose a burden on applications that don't require it. So the reason I always talk about end-to-end is I want you to think through carefully what it really means when you're designing a distributed application and you're thinking about putting a bunch of things in the middle are they really enhancing your application or not? And so the end-to-end argument is a starting point for a thought process. We started talking about two phase commit which is distributed decision making. We first make sure everybody guarantees that they'll commit if asked that's the prepared phase and next everyone commits properly and we can do that in a way that works even if some of the components are flaky. All right, we're gonna leave this I guess I'm a little behind where I thought I was gonna be today, that's fine. We'll pick this up on Tuesday and we're gonna talk about we're gonna finish up two phase commit. We're also gonna talk about more interesting decision making including things like Byzantine agreement and we'll talk about blockchains as well and then we'll move into some distributed storage applications. So you all have a great weekend and as I see on the message there, go bears. Bye now.