 Okay. Welcome back everybody. We are going to pick up where we left off last time talking about distributed decision-making. And just to set the context here, consensus problem basically is one in which we have many nodes in the system. And by the way here, just in case you were wondering, a node is a separate physical box that might be connected only via network. Some nodes can crash and stop responding and eventually all nodes decide on the same value for a set of proposed values. So that's basically the consensus issue here that we're trying to solve. And the key thing that makes this difficult is basically the fact that nodes might crash and stop responding and then they come back and we want to make sure that they still everybody kind of does the same thing. And that's what we're going after here. And a little bit, we'll also talk about what happens if nodes are actively malicious and trying to screw up the process. But we're not there yet. So distributed decision-making is, for instance, the notion of all of the nodes are going to choose between true and false or commit and abort. These are kind of equivalent ideas. And it's going to be atomic in the sense that all of them will decide on true or all of them will decide on false, but we'll never get a mixed grouping of them. And equally important but something that sometimes gets forgotten in this whole process is making sure that once the decision is made it's not forgotten. So if you have a set of nodes and they make a decision and then they immediately crash and lose all their information that is as if they never made the decision in the first place. And this is just to remind you this is the durability or D portion of acid. And in a global scale system D gets a little bit trickier but we talked last time about erasure coding or massive replication or even blockchains have a replication aspect to them for getting our durability. So we were at the very end of last lecture talking about two-phase commit and really we came into two-phase commit because we couldn't solve the general's paradox. If you remember the general's paradox was two or more parties have to decide on a time in which to perform some action like attacking and the messages going back and forth are unreliable. And really what we showed is that this is impossible to do and so what we're going to do instead is a simpler problem which is we're going to get the machines to agree to do something or not do it atomically but we're not going to force them to all agree on a time. All right and so the two-phase commit protocol roughly speaking has two phases not surprisingly right the prepare phase is one in which a coordinator and there is a single coordinator in this basically requests that all participants make a promise to either commit or abort or roll back the transaction and participants are going to record their promise in the log and then they're going to acknowledge by saying whether they will commit or abort and the main the coordinator will basically make a decision to either commit or abort based on what it hears from the participants and essentially if any of them say that they want to abort and the coordinator will abort and if all of them say that they're going to commit then only then will it actually commit. Okay so the commit phase is basically that point after we've heard from everybody and we make a decision that either everybody wants to commit so we'll commit or somebody doesn't want to commit in which case we'll abort. Okay and a key aspect of this is the persistent log on every machine that's participating both the coordinator and all of the additional participants and what is this log doing for us well this is basically helping those nodes remember what decision they've made so that if they crash and come back up they will continue to make the same decision and this is where two-phase commit gets interesting because what we're trying to do is we're trying to make this atomic decision in which everybody makes the same choice and acts on that same choice regardless of the fact that some of these nodes may crash and come back in the middle of it. Okay so let's set this up a little bit more we were starting to kind of look at the meat of this and any of you who have and any of you who have actually started in on experimenting with homework eight will kind of understand what's going on now we have a question in the chat here basically saying not yet convinced to this process what if the coordinator fails to receive an abort well if the coordinator fails to receive an abort basically that's a timeout and it will assume it's an abort okay so we can basically make the decision that any time we don't hear from somebody we're going to assume that they're aborting and you'll see that that allows us to keep our atomicity on this process now so the coordinator basically initiates this protocol and asks every machine to vote on transaction and the two possible votes that the participants can come up with is commit or abort and the commit only will happen if it's unanimously approved by everybody and the coordinator doesn't receive the coordinator waits until it receives votes from everybody and if it times out then it's going to assume that somebody was going to abort and it will treat that as such now preparing in the prepare phase if a machine has decided to agree to commit what it does at the point that it's made that decision is it's guaranteed that it will accept the transaction all right and what does it do well first of all after it's decided that it's going to accept the transaction it makes a little mark in its log before it responds saying that it will remember the decision so even if it crashes before it tells its decision to the to the coordinator if when it comes back up it looks in its log and it sees what it decided to do and it will keep with that decision once it's made it now if it agrees to abort instead we have the same idea the machine is guaranteed that it'll never accept the transaction even if it's crashed and comes back up again and so this is recorded in the log and so the machine will remember that decision if it ever crashes and restarts so this commit phase or the finishing phase basically the coordinator learns that all machines have agreed to commit and records its decision to commit in the log and applies its transaction and tells all the voters to go ahead and commit and we're good to go and even if the if the coordinator crashes and comes back comes back up after it's made the decision to commit it'll see that in its log if it comes back up before it's made that decision it's going to assume that it's missed out on some messages from the the participants and it'll just go ahead and tell everybody to abort so the abort action is when the coordinator learns that at least one machine is voted to abort records its decision in the log and basically tells the voters to abort and if you notice basically because there's no no machine can take back its decision because of the log we will get this atomicity out of this okay now there's a question here that if a node crashes is crashes indefinitely is a backup node pulled in and used in its place now the answer is no and as you have identified here the one of the big issues with two phase commit is it can potentially block indefinitely in bad circumstances and we'll talk about that in a second okay so two phase commit by its nature does not have the ability to pull in backups okay it's a it's a simple algorithm and we'll go from there now so here we go with an example just so that you know about it so the coordinator says oh i'd like to know what you want the workers wait for that and if they're ready to commit they'll send back a commit if they're ready to abort they'll send back an abort the coordinator waits until it hears from everybody if it gets a vote commit from everyone it'll send a global commit otherwise it'll say a global abort and then finally the worker waits until it hears the status from the coordinator and if it's a global commit it'll do a commit operation if it's an abort it'll do an abort operation and notice by the way that regardless of what the worker decided to do during the first phase if it hears that it's supposed to abort during the second phase it will abort so here's some examples for instance here's a failure free example the coordinator says vote request they all say commit the coordinator says global commit and we're good to go and everybody commits now this might be a good time to say well what does it mean to commit well remember for a moment here that this algorithm is really just trying to make a decision commit or abort that's it and it's an atomic decision so they all make that decision to do something but what it is that they do is what you're applying it to so for instance it could be that there was a an update to a database if it's a key value store that you're going to add this key with this value to the global data store the global commit would be that everybody has agreed that should happen okay and so the the commit and abort actions are basically the global decision on what to do with some proposed action prior to that now basically you can also view this as a state machine on both the coordinator and the workers so the coordinator has a simple state machine with four states starts in a knit and when it's ready to start it basically sends out a vote request and and then it waits until it hears from everybody if it hears a vote commit from everybody it goes forward if it hits a here's a vote abort from anybody then it'll send out a global abort okay the worker kind of looks like this so the worker basically also has an in it and if it hears a vote request and it wants to commit it'll go to the ready state and wait otherwise if it if it wants to abort it'll go ahead just to an abort state now the question of what happens if a worker misses a global commit a good example of that would be if it crashes and comes back up well at that point it can't make any forward progress because it doesn't know what the decision of the coordinator is and so at that point the worker can start polling the coordinator to see whether there's a decision yet all right now so work how do we deal with worker failures for instance so if you notice here that's sort of a good segue into the previous question this state that I've colored in red is one where the coordinator is waiting to hear and the failure really only affects states in which coordinator is waiting for messages and the coordinator only waits for votes in the wait state and once it hears that there's an issue it will you know if it times out or whatever then it will assume that we're going to do an abort okay and if it doesn't receive end votes it times out and sends an abort so the way that this protocol is set up is it set up so that whatever failure cases might happen we will always keep the main constraint which is that either everybody agrees to commit or everybody agrees to abort and we don't get a 50 50 or whatever some of them do one thing and some of them do the other so here's an example of a worker failure so the coordinator sends out the requests but only two of them come back and nothing happens eventually there's a timeout and at that point the decision is made to abort and since that's recorded in the log by the way even if the the coordinator crashes and comes back up whatever there'll never be confusion on this and if the worker three eventually reboots it can ask the the coordinator and they'll find out that abort was what happened so how did the how did the workers deal with coordinator failure well this is a little more interesting here so you know we wait in one of a couple of places one we wait for the vote request from the coordinator and in that instance you know if nothing happens and we time out we just kind of wait to hear what's going on and eventually if we want to commit we go to the ready state and we're going to wait to find out what the decision was if we decide to abort then we know what the decision is going to be which is abort okay now a worker waits for the vote request and in it voters can time out in abort worker waits for global star messages in ready and if the coordinator fails the worker has to wait okay and the reason for this is once the worker is said vote commit it has to find out what the decision was it has no idea and the only way that it can move forward is by hearing from the coordinator so in this instance basically what we find out is or what happens here is we have to just stall there's really no other option here and so this is part of why this is a blocking protocol as you can see so here's an example of the coordinator failure failing maybe it sends out vote request but nothing happens we eventually time out and all the workers might abort in the init stage or here we send out vote request we go forward by everybody saying committing and eventually the coordinator crashes at that point the the workers can't do anything they're all waiting in that ready state but when the coordinator comes up it figures that it's missed some messages and it just says abort all right so all nodes have to use stable storage to store the current state and stable storage is basically non-volatile it could be a disk it could be ssd or nv ram whatever and on recovery then nodes can restore the state and resume so you know coordinator aborts an init waiter abort etc we can list all of these out the key thing is that this algorithm is one such that no matter when and and how the nodes crash and come back up if we do the right thing with the log will always maintain our atomic behavior where they all either decide to commit or they all decide to abort so really the kind of the key issue here is blocking for the coordinator to recover you know a worker waiting for global decisions can ask fellow workers about their state for instance so if we're in the ready state and we don't know what's going on we could ask other workers and if they've already gotten a global commit then we can take that decision and move forward okay does some very the question that's come up is does some variation on this system allow for non-unanimous decision-making i.e. simple majority voting so the answer is not the two phase commit so two phase commit is an all or nothing we'll talk about some majority voting kind of options in just a moment here so if another worker is in a border commit then the global coordinator must have sent a global message and the worker can safely abort or commit respectively and so basically there are cases in which we can exchange information between the workers to find out what what happened if we missed it or it was lost if another worker is still in the init state then both workers can decide to abort in that case for instance if all workers are in the ready then we really need to block because we do not know really what the the coordinator is going to do you might guess for instance that well because they're all in the ready they all said vote commit and so therefore the coordinator is going to choose to commit but in fact the coordinator might have crashed and come back up and lost its state and in that instance it's going to abort so we really have to wait when we're all in ready to move forward okay so why is distributed decision-making desirable and the answer is fault tolerance we want to have a bunch of nodes together making a decision and we'll wait until they all make the same decision then we know that we aren't do making our decisions based on faulty information a group of machines come to a decision even if one or more of them fail during the process because we come back up again eventually and move forward and after the decisions made the recall the result is recorded in many places and so we'll know what the decision was even if they subsequently crash so why is to face commit not subject to the the generals paradox I actually saw somebody ask that question on piazza 2 and really to face commits about all nodes eventually coming to the same decision but not necessarily at the same time and that's really we're allowing reboot and continue and reboot and continue to gather information so that eventually they all have recorded in their logs their decision and we can make an atomic decision the generals paradox had this problem that we were never quite sure that our messages made it through and therefore there was no way to settle on a time for sure okay so an undesirable feature of to face commit is blocking as we've mentioned and one machine can be stalled until another site recovers you know site b writes prepared to commit in its log and sends a yes to the coordinator and crashes site a crashes b wakes up checks its log realizes that it has voted yes and now it's stuck until b is basically blocked until a comes back and so there really is nothing in this protocol that allows nodes to not eventually be present without blocking forever okay and so that's an issue and a block site essentially holds resources which might be locks or pages pinned in memory or whatever until it learns the fate of the update and so that's an issue all right so there's a number of interesting alternatives to two phase commit there's three phase commit which is one more phase and it allows nodes to fail or block and that's more of a majority voting kind of scenario it's a little better paxos is a very popular example it's an alternative that's used by google and others it does not have the two phase commit blocking problem it was another protocol developed by leslie lamport mentioned him last time there's no fixed leader and it can choose a new leader on the fly and deal with failure this is extremely fault tolerant there's some that would claim that this is extremely complex and there's even been a lot of papers about you know taming the complexity of paxos and so on but google seems to have done so and they're using it actively raft is a variant developed at stanford which is an alternative to paxos which the claim of john osterhood at stanford and his students are essentially that this is much easier to understand and therefore much easier to implement correctly but that's a that's another fault tolerant version but what's interesting is what happens if one or more of the nodes is malicious and this is an interesting question where malicious is actively trying to screw up the protocol all right hold on a second i'm going to pause i'll be right back so the question that's on the sorry about that i'm back so the question that's on the chat here is which protocol is commonly used in industry so right now paxos is pretty commonly used two and three phase commit have been common for a long time in databases distributed databases but paxos is used pretty widely by google and there are some libraries that do paxos so now what happens so there's another question here with workers waiting once the server crashes i'm not sure i understand that question did i miss the middle of it oh okay so now what's interesting about these other protocols up top here is that if a node is malicious which means that it's been broken into or it's running a version of the protocol that is designed to mess up the decision making then they are not resilient against that so even though things like pat three phase commit paxos raft etc might be resilient against failures and may manage to get forward as long as move forward as long as they're say a majority that they're still functioning properly if one of those nodes is actually malicious then they're not and so this becomes an interesting question what do we what do we do in those that instance okay and so we have another leslie lamport paper that was quite interesting i'll put up both the paxos paper and the visiting generals papers up on the resources page i may have done that already the visiting generals problem is as follows there's one general and n minus one lieutenants okay and so uh n total participants here and uh the some number of these are malicious okay or insane or going to act uh weirdly okay or incorrectly or maliciously and so the the question is what do we do then and uh before we can actually solve the problem we really have to figure out what we're doing so what we want as our parameter our semantics here and what we'd like is the commanding general is sending an order to all of his lieutenants and we'd like the following integrity constraints to apply here i see one says all loyal lieutenants which are those that are not malicious obey the same order so if you notice here these two lieutenants are the ones in the red hats basically are both deciding to attack and if the general is loyal and not basically malicious then the the loyal lieutenants will do what the general says okay so how could a general be malicious well a general could tell all of his different lieutenants to do something different and in that instance then um we're going to say that the general is malicious and what happens then is the remaining loyal lieutenants are all going to do the same thing okay so they're always going to have i see one and in the good instance in which the general is loyal they will also do what the general wants okay now notice that i've said here i've sort of introduced some terminology so or some notation so one of them is that there's going to be f malicious entities in the system here and and total entities okay and so what's interesting about the original paper from leslie lamport is he shows that we can't solve the Byzantine general's problem with n equals three because they're one of malicious player can basically mess everything up so here we have an instance where one lieutenant is malicious the general is not this lieutenant is not so the blue ones are not malicious and the general tells each of them to attack but the lieutenant the blue lieutenant has no idea whether the general is malicious or not so it's got to find out what the the tan lieutenant says and the tan lieutenant says well the general told me to the retreat and so now this blue lieutenant is stuck because it has no way to make a decision that will let him satisfy both the interactive consistency constraints i see one and ic2 in the case of the general being malicious you know it says attack to one and retreat to the other then this poor lieutenant on the left is once again lost because he's hearing attack from the general and retreat from the lieutenant and if you notice these two scenarios the one on the left on the right are the same as far as this good lieutenant is concerned and so basically the impossibility results says you can't solve this problem with n equals three and in fact then it quickly generalizes to show that if you have f faults or f malicious nodes then you have to have n greater than three f total participants in order to solve this problem okay so there's a bunch of algorithms that exist to solve the problem the original algorithm was purely a thought process because it was exponential in n which is never great right newer algorithms although new is perhaps overstating it since they're from 1999 basically have a complexity that's about order n squared that's supposed to be n squared sorry about that in the number of nodes and so a message complexity of n squared is doable but you're probably not going to want n to be too big okay and I will say that I've even designed systems with the MIT version of the Byzantine general solution where we kept n to four seven or ten and not too much bigger because then the message complexity gets pretty complicated at that point so this Byzantine fault tolerant algorithm BFT is what this Castro and Liskov algorithm is called and it basically allows multiple machines to make a coordinated decision even if some subset of them basically less than n over three are malicious okay and so what you can think of again going back to our earlier discussion of distributed decision making is that requests kind of come in from somebody a client or or one of the participants they go through this decision mill where they're running the n squared algorithm that solves Byzantine generals problem and as long as these red malicious nodes are less than n over three total then what comes out are distributed decisions that are agreed upon by all the non malicious parties okay and so that's kind of a this is a pretty key advantage to a good Byzantine generals solution is that we can have a coordinated set of nodes that together come to a decision even if a few of them are malicious all right questions on this and notice by the way in reference to some questions earlier that we're not talking we're talking here that more than two-thirds of the nodes have to be non malicious in order to solve this in the in the previous algorithms we were talking about that aren't tolerant in the malicious nodes it's only you only need more than half to be to be non to be non faulty okay more than half to be non faulty that you can have up to half of them being faulty all right so now enter blockchain so what's interesting these days is of course there's lots of discussion of blockchain since the 2009 introduction of bitcoin way back when it's hard to believe that's been over a decade ago but what's interesting about blockchain algorithms is let's start with what a blockchain is so blockchain is a set of transactions that are back linked with hash pointers and those of you that have taken 161 or know something about security will know that what this really means is you take the contents of a block on the left and you run a cryptographically secure hash over like a SHA-256 and you put the resulting hash into the next block and as a result as long as you know that hash it's impossible to insert something on the left okay without being detected and so the way a blockchain typically works is mostly everything's in a single chain except at the very head where new transactions are being added and in those cases there are some possibility possibilities for the new head or some branches and what happens is eventually the one branches with the longest chain become probabilistically the final head and if you run this long enough it all the new stuff eventually looks like what I have on the left here okay so blockchain itself is a chain of blocks connected by hashes to a root block the chain has no branches except for at the heads and blocks are authentic part of the chain when they have the right authenticity info in it now if you were taken 161 and not talking about bitcoin or something like that you probably think that something's authentic by there being a signature well that's a one way to do it in bitcoin or ethereum or some of these other blockchains what actually happens is the heads chosen by some consensus algorithm and in many of them the head is basically chosen by solving a really hard problem some extensive search of the hash chain space to find to find cryptographic uh proof that this is the right head okay and this is the job of the miners who try to find basically a way to put some set of bits into the packet such that when you take a hash the resulting hash has some number of zeros and we can talk about that offline or at office hours if you guys are interested but this is called a proof of work because you have to burn a lot of cycles on a processor and burn a lot of energy to get it and uh selected blocks above here presumably already have the proof of work in them this uh hashed one I've got that's green is an example of one that's got a proof of work but isn't known by everybody yet and so it's not um considered the the um the final chain yet that's still kind of tentative okay and this is a longest chain wins kind of scenario now why this is good for uh bitcoin is that these transactions represent the exchange of money um for uh you know if you buy coffee with some bitcoin um that's a an awful lot of coffee these days still uh but um it used to be that bitcoins were worth um you know seventy dollars or a few dollars and it made sense to have some micro bitcoins spent for coffee now there were thousands of dollars and uh it's a little bit less obvious that you want to do that but uh you might ask a question about is this blockchain algorithm a distributed decision-making algorithm and the answer is really you can think of it that way because once we've got some item uh some choice of commit or abort that's in one of these solid green blocks that has been held on that has been part of the long chain then you can't change it and so now it's a distributed decision that everybody will agree on and so if you look at the way that for instance a typical blockchain algorithm might work here's the cloud you've got these miners um that are around the world trying to solve these uh proof of work problems and uh what happens is they're basically copying information to each other and as soon as somebody solves a proof of work it's very quickly replicated to everybody else and that person's success at solving that problem gets them a few fractions of a bitcoin and everybody hears about it and that becomes the new head of the chain okay and so the way you'd use this for distributed decision-making is you'd make a proposal to one of the miners that instead of being like a bitcoin transaction would be something like I would like to commit the following record to my um to my uh distributed database uh and depending it doesn't matter who you send it to eventually they send it to everybody else and those transactions get put into the blockchain and um they become distributed decisions okay and so a decision in this case means the proposal is locked into the blockchain could be a commit a board decision could be a choice of some value or a state transition whatever um you know if you put give a proposal and you get a knack back you might have to retry because something went wrong but once it's in the blockchain then everybody can observe it and those of you that know anything about bitcoin know that um there's a much smaller number of miners than there are people that are are observing and using the blockchain and pretty much anybody who gets copied the blockchain can verify the decision so we have the nice uh property with blockchains here of um basically the decisions that are locked into the blockchain can be verified by everybody okay and so that's um so I would say that yes the blockchain is a distributed decision-making algorithm and interestingly enough there are a number of these out there now that use not necessarily the bitcoin blockchain but use other blockchains to solve the Byzantine agreement problem despite the fact that there are malicious parties in the system and what's interesting is whereas back here when we talk about say the MIT BFT algorithm which is n squared number of messages bitcoin uh bitcoin excuse me blockchain style uh Byzantine general solutions tend to be uh closer to linear in the number of nodes and so you can have many more nodes involved and so this is interesting and kind of exciting to see where this goes okay and these are these uh Byzantine agreement algorithms are relatively recent within the last five years or so um so there's a little bit uh question uh on the on the um chat here about saying a little bit more about what the block contains I'm not going to say too much more because I don't want to spend too much more time here but take one of these green blocks what it is is it's a series of transactions from different people um so when you when different people propose a transaction it's epidemically sent to everybody and what the miners do is they collect all the new transactions into a block and then they uh and then they start adding numbers to that block and hashing it and that's the problem they're solving and the first one to figure out sort of which four bytes to add to the set of transactions such that the new hash over it is uh got a number of zeros uh that's specified by the current state of the system uh solves the problem and gets the gets the coins and so from the standpoint of us discussing Byzantine agreement here really what the proposals are they go into these green blocks and those proposals are things like commit or abort and usually have attached to them you know commit this key value uh pair to my global store okay I hope that helps a little bit um to the question that's on the chat the other thing I will point out is you can sort of see one of the big problems here there's a lot of people that like to think of blockchains as the the solution to all of the world's problems and what they do is they talk about rather silly things like I'm going to put my videos into the blockchain because then they're guaranteed to be authentic and everybody can verify that well if you look at what's going on here the any data you put in the blockchain gets replicated all over the world and it's extremely expensive process and so putting everything into the blockchain is is actually a pretty um you know almost a non-starter although many people forget and are doing it so but there's another question here on the chat but I'm going to let's talk about this offline if we could Jeffrey I think that's a more extensive question okay so um anyone in the world can verify the result of the decision making all right um so uh now I want to switch gears a little bit and uh if you notice uh yeah you can google by the way all sorts google blockchain there's a lot of really interesting information on there okay so now let's move forward um a bit here and uh there are many levels of networking protocol and what I wanted to do before we move forward with some of the more interesting distributed file storage systems and peer-to-peer protocols is I want to very quickly get some common terminology for everybody here on some networking terms okay and many of you have probably taken 168 so some of this will be things you've heard before but I just want to make sure we all have this so the network networking protocols are extracted at a number of different levels there's the physical level which is the uh mechanical electrical network itself sort of how the zeros and ones are represented there's the link level which is how to actually transmit physical small packets uh over these physical links and then what's more interesting for us in this class at least is the network and transport level where we put together small packets into bigger ones that are reliable and figure out how to deliver a packet from here to uh the other side of the world um and deliver it to the right application on a particular node okay so that's kind of what we want so protocols on today's internet um sort of showing you at least these kind of three layers there's really four of them illustrated here the physical and link layer are things like ethernet or wi-fi or lte that's one hop worth of communication the network ties it together with ip to transmit data multiple hops and then the transport layer starts worrying about things like how to deliver to an application directly and how to do um reliability and so on okay so start with i want to say a tiny bit about the physical link layer so to do that we're going to talk about broadcast network so broadcast network is a shared communication media and although it doesn't have to be wireless it can be a wired situation i'm showing you up here in the upper right corner just think of it as a broadcast network where we're sending to everybody who can hear us in the equivalent view of this might be like a bus where all of these items a processor and a bunch of i o devices in memory are all attached to the same wires and as a result when the processor sends out a request uh everybody can listen in okay and so the shared medium could be a set of wires or it could be uh you know the space around a wi-fi etc uh what's perhaps interesting here is ethernet in its original incarnation actually uh was used as a broadcast media where you had a whole bunch of uh items here's three workstations and say a router all connected over to the same cable and all communication basically went to everybody all the time okay in a local subnet all right and um many examples of this okay cellular phone cdma lte wi-fi etc so um what's interesting about a broadcast network is when i'm sending from say this node here to from node say three to node two and i'm sending my data over that broadcast media what it really means is that every one of these nodes have to look for at least as long as the until they get the header to know whether they need to observe or can just ignore the packet okay and that header address is typically called a media access control address a mac address and um most of the things you're going to encounter now are 48-bit physical addresses mac addresses and in theory which is kind of amusing they're supposed to be unique for every device everywhere in the world is supposed to have a unique 48-bit address and there's uh there's a special way to uh to identify the various tuples in these 48 bits that have to do with manufacturers and um which item number it is and so on um there are some reserved bits that are supposed to be setable so those are not necessarily unique and uh any of you who have played a little bit with your networking stack on your machine know that in many cases you can just set a software version of this into the network card and it will ignore its own id for the one you tell it to so this idea of things being unique is uh more aspirational than real but um every card that does come out should have a unique address um so how do you deliver this when you broadcast a packet well you put a header on the front which is a mac address and everybody gets a part packet and discards if uh it's not the target and typically this is all done in hardware so the software stack doesn't have to deal with it too much okay now um i did want to say uh give you guys one little interesting tidbit here so uh as you can imagine if everybody's on a broadcast media and multiple nodes start talking at once you're going to get chaos and so how do we deal with that all right well the way we deal with this is uh in fact something called uh csm uh csm ad called uh carrier sense multiple access collision detection and it's uh from the early 80s uh ethernet and it was the first practical local area network and uh ethernet has most of the ethernet protocol has survived for the last many years okay almost 30 30 years 40 years okay and uses a wire instead of radio but it's still a broadcast media and the key advance to making this work was this arbitration mechanism csm acd uh and how that works is as follows uh everybody who is attached to the network when they start talking there's a carrier that goes out and so what that says is before you talk you listen so that's carrier sense and if you hear somebody talking you just don't say anything until they're done so that's a way to avoid talking over people okay and uh however it's possible that both nodes uh start talking at exactly the same time so they don't hear anybody so what they do is they start talking simultaneously and at that point they both are listening to the medium as the same time they're talking to to uh notice when there's a uh collision okay if there's a collision then both nodes stop talking and they back off and retry later okay the back off scheme is basically choosing how long to wait before trying again and uh how do you determine that well if everybody always waits the same amount of time then you're just going to collide over and over again so instead what happens is uh you basically have a random mechanism for randomly backing off okay and so it's an adaptive randomized waiting strategy you don't want to wait too long because that's going to destroy your bandwidth so what you'd like to do is figure out how long to wait but do so randomly and so what happens is you uh first time you pick a random wait time within a small interval and for every time you collide you up your interval and so um basically uh what happens is the uh the average for the wait times keeps increasing uh by a factor of two every time there's a collision until you eventually get to go and so what's nice about this uh CS uh MACD protocol is it automatically figures out probabilistically how far to back off so that if you have two people uh trying to talk versus four people trying to talk there's a different back off process and this works remarkably well and uh it still is in most of the uh ethernet stacks that you're going to run into okay are going to do this uh back off all right okay so that basically gives us a way to deal with a broadcast media even when there's multiple people on there so the question about how does the sender check for collisions is really that um you notice that uh your bits are being trampled on top of by somebody else so you can see that there's checks on that's failing all right um so let's say a little bit about the MAC address here for a moment so it's a unique physical address at the interface uh if you were to look if you were to take 168 you'd see that this is typically the physical and data link layers okay and uh what's interesting is I just wanted to mention uh for those of you that uh might take a look on your phones you can see that the wi-fi um a hardware of your phone has this 48 bit address so um notice this is six uh double hex digits so that's 48 digits um and also uh if you do something like if config on a windows box etc uh you can see that like here's the wireless LAN adapter has a 48 bit MAC address and your ethernet adapter has one as well so every one of your physical system items in the system have a MAC address so um you might ask yourself all right so why have a shared bus at all why not simplify and only have point-to-point links and the answer is well originally it wasn't cost effective uh originally it was much easier to drop a cable that snaked around the whole floor in fact I um my uh where I was graduate student originally we had one of these that was uh up in um in the ceiling and they dropped down and we attached it to every one of our machines and there was a shared media for every network uh for every machine on the floor okay um however you can imagine that's got bandwidth issues because you'd like to have point-to-point networks where only the communicators are actually communicating and so that would be a network in which every physical wires connected only two computers and so how do we do that we get a switch all right and so in that instance of a switch what we have here is the switch is a piece of hardware we've got point-to-point connections okay and this is a bridge typically that transforms the shared brought uh bus broadcast media into a point-to-point network and uh you can buy switches pretty much anywhere you can get them at fries or ethernet uh and what happens here is even though you can in principle broadcast or multicast everybody uh that's connected to the same switch if you're just doing point-to-point communications the switch adaptively learns where all the what all the MAC addresses are and when you send a message address to a particular MAC address then the switch basically just routes it internally to the right port and now I can get uh as many pairs as I've got going here depending on uh how much bandwidth my switch has okay now a little different than that is a router and what a router does is it basically is a way of transferring packets from one switch domain to another and so when you go across a routing domain they're not routed by MAC address and something else has to happen and this is the point at which IP comes into play all right so we're going to take our brief break here and uh we will be back in just a second okay so uh so IP um as you're all aware is the protocol that has really taken off it wasn't the only protocol originally for uh routing across physical domains but now it pretty much has taken over and uh basically it's a way of getting packets from some source to some destination uh no matter how far away it is and so this is the internet's network layer so if you were taking 168 you'd see third layer protocol and the service that it provides is best effort so what does that mean that means that when I send packets from source to destination they can get lost um or they can get uh corrupted or they could get duplicated or they could arrive out of order okay and so really um you might say it doesn't guarantee much but surprisingly it guarantees enough to um make our very interesting packets uh very interesting applications that we're all used to that we know and love okay and um so the IP packet itself is called a datagram and this is a datagram service which uh can route from source to destination across many hops across the planet okay and so that's remarkable that it works as well as it does okay so um there are IPv4 and IPv6 addresses uh the IPv4 address space which is much more common still uh has an address that's a 32-bit integer so notice that's different from our 48-bit MAC address and it's the destination of the IP packet it's often written as four dot separated integers so here's an example uh for instance at one point the file server for cs was 169.229.60.83 sometimes you see it written as uh as a set of hex digits like o x a 9 e 5 3 c 5 3 etc okay um a host on the internet is a computer connected directly to the internet and the host has one or more IP addresses used for routing um some of them may be private and unavailable for routing so not every 30 of the 32-bit addresses can go everywhere and I'll say more about that in a moment but groups of machines may actually share a single IP address and in that case uh we can get what's called network address translation I'm sure many of you who have networks in your house um have a router to a service provider like Comcast or what have you and that router then connects to all of your devices maybe either wired or wi-fi and what happens in that instance is the the world sees your house uh with a single IP address but inside you have local addresses okay and network address translation turns the uh query from your laptop through the gateway to the to the world uh from the local address that your laptop has to uh to the remote address and it does so in a way that keeps that connection unique and allows it to work so it's interesting about this is uh the number of network connected devices in the world is tremendously larger uh than just what you get by seeing all of the 32-bit addresses that are reachable uh in the public internet okay so network address translation gives us that capability now within this a subnet is a set of networking uh network that's connecting hosts with related IP addresses that's typically for instance either that broadcast domain or that switch domain I mentioned earlier um and a subnet's identified by a 32-bit value with the bits uh that are differ set to zero so an example here might be 128 dot 32 dot 131 dot 0 slash 24 or the other way to do that is 128 dot 32 dot 131 dot xx what this says is every host address that matches in the first 24 bits but not in the last eight is considered together and on the same subnet all right and typically that's a set of machines that are all connected together either in a common switch or on a same the same physical network okay and oftentimes there's a mask which can be used to identify the subnet so if you look at this address 128 dot 32 dot 131 that 24 bits there represents a unique subnet uh and then the last eight bits represents the host and the mask this 255 dot 255 dot 255 dot 0 is 24 ones and eight zeros when you and it on top of 128 dot 32 dot 131 dot whatever address you get only those 24 bits that represent the subnet all right and so often right routing within the subnet is done by MAC addresses by the switches not necessarily by IP addresses and in fact a lot of the ports that are in for instance soda hall there's a lot of MAC address routing that's going on in on the subnets there to make it fast okay so address ranges in IP I'm not going to go on this in great detail but um back in when it was first started up when IP first became very popular there were what were called class a networks which are ones that that only have the first um that only have the first octet uh unique so like a 10 dot something dot something that something or a six or a 127 dot something that something that something those are class a addresses um mit for instance is 18 dot xx dot xx dot xx dot xx and so what that means is all of the two to the 24 hosts um are all kind of owned by that organization now there's a question about I'm not covering uh what's the difference between a MAC address and an IP address we'll say more in a moment but the MAC address is 48 bits and it's only you uh routes within a switch domain and IP is what routes across switch domains through routers okay so think of a MAC address as a physical uh address attached to a an actual network card and an IP address is a um a virtual host address that is used for routing on the larger scale so um class a basically is a slash eight network where the first eight bits are unique class b the first 16 bits in class c the first 24 bits and um some of these are what are called private networks so for instance 10 dot uh something dot something dot something is a private address that's a class a and so if you were to use the VPN at Berkeley and log in you'd find that your computer has a 10 dot something address associated with that VPN um commonly if you buy a router it fries or it best buy and you put uh you put one of those routers on uh on your network you'll see that the 192 dot 168 is a very common class c um network that's used a lot um and is private okay so um oh by the way how are MAC addresses different from ethernet addresses i guess i didn't quite answer the question that was on uh on the uh group chat so the ethernet addresses are the MAC addresses okay those are the those are the MAC addresses okay so address ranges are often owned by organizations and can be further subdivided into subnets so for instance the um you know i said a mit is one of few the few institutions that actually has a class a address and um they certainly don't have two to the 24 hosts all tied to a single physical domain instead they're all divided into a bunch of subnets which are then physical domains but um the class a address is something that mit has full control over those addresses all right now the ip uh for format as you've seen if you take it 168 um it is a set of uh bytes that go in front of uh the data all right and um it's a well-defined format i'm not going to go into it in detail but what you can see here is for instance if you look uh in the packet header um there are there's a four in there for ipv4 there's a total length of the packet um in bits there's some flags there's a checks um um and then there's a source and destination ip addresses so the source address is where it comes from and the destination address is where it's going to and this is a basic ip datagram all right and this is set unreliably from one host to another notice there are no ports in this we'll get the ports in a second but ip uh can only go from machine to machine not from application to application all right so now what's a wide area network it has many of these physical domains okay so the internet is a wide area network um it connects multiple physical data link layers with routers so you can see these routers what goes on inside of the subnet is kind of up to the owner of those devices so even though i kind of show that host a um enters into this uh domain and then goes through our to our for uh to the destination it's possible there are other hops inside here which are handled by the owner of this domain um either via mac addresses or something else uh the data link layer networks are connected by routers as i mentioned here okay and we can uh we'll see say more about a router here so a router forwards each packet received on an incoming link to an outgoing link so here the router is circled and what this says is that if a packet needs to go from point a to point b and it goes through a router we have to make sure that when it arrives at the router it knows the router knows what the next hop is okay and so the router uh is a highly optimized piece of hardware software um device that basically takes packets coming in off the network on a you know 10 gigabit or 100 gigabit link and is able to often at line speed if it's owned by a comcast or some service provider can basically pull the packet in fine just uh take the header off figure out what the next port is and send it on its way at line speed hopefully by basically keeping this thing going at one gigabit or 10 gigabits or 100 gigabits whatever so here's an example of packet forwarding so here we have host a is talking to host b and as you can see here basically on receiving a packet the router figures out how to forward it what's the way to get it closest to the destination and um if it doesn't know anything about how to get it closest to the destination then it might send it to a default route which hopefully has more information okay so here's an example of that packet going on there everybody catch that see you do do do do all right so what about ip addresses versus mac addresses why not have everything routed by these 400 these 48 bit mac addresses and the answer is it doesn't scale that well okay the analogy here is mac addresses kind of like a unique social security number that everybody has an ip address is kind of like your current home address so the nice thing about your current home address is you can hierarchically say that it's in some state which is in some you know and it's in some city in that state and it's in some sub piece of the city in that city and so on and so you can do hierarchical routing to home addresses whereas hierarchical routing to social security numbers isn't doable because each social security number is assigned uniquely to a person and it's not based on anything to do with locality but rather based on the person okay and so a mac address is kind of like a social security number it's uniquely associated with the device for the entire lifetime of the device and so you know your your ip address changes depending on where you are so when you're in soda hall your laptop gets one ip address when you're in the dorms it gets a different one and when you're back home it gets a different one and that's because by and large not exclusively but by and large ip addresses relate to physical locations okay and so if you look basically if we move then we're moving our address to something new and therefore it's easier to route okay so I so why does packet forwarding use ip and why is it scale better so we just kind of said that but specifically ip addresses are aggregated and hierarchical okay so i all ip addresses that uc berkeley might start with an o x a 95 okay in reality there's 128 dot 32 dot dot and 169 dot 229 dot those are the two ranges that represent uc berkeley addresses but it's that aggregation that helps the routing okay all right i think i've said enough on that and yes there is somebody who's noticed that the first few digits of the social security number were originally based on sort of where you were born but people move around a lot and the social security numbers are in the position i think of being recycled now so i think that original locality of social security numbers is you know there is no actual locality as a result okay so how do you how do you set up the routing tables while the internet has no center centralized state no single machine knows the entire topology of the internet in fact it's fascinating to read books on the topology of the internet because the internet is a whole series of loosely collaborating administrative domains that have a set of agreements with between each other and there's cross point places where certain classes of i or certain groups of ip addresses will route quickly while other ones will route more slowly and this is all based on agreements and so no single machine knows the topology and the topology is always changing and there's faults and reconfigurations and so on and so you really need a dynamic algorithm that somehow acquires the routing tables so that we can even figure out how to get a packet onto its next hop and so there are many possible algorithms you could imagine okay one of the the one that's common now is called bgp but you know the routing table has a cost for each entry that sort of reflects how many hops it will take to get to a certain destination address and there is some optimization for hops okay and neighbors periodically exchange routing tables to try to make this to optimize for cost the problem is this particular algorithm that optimally tries to eliminate or optimally tries to have the fewest number of hops square scales is n squared and so that's not generally what's done in the internet instead there's basically whole groups of addresses that are run at different scales and so on and so your path from point A to point B is certainly not optimal in the internet and in fact sometimes there are loops and so you can actually have situations where packets on their way to their destination get routed in a loop and they just keep looping around and they would loop around forever if it weren't for the fact that there's a time to live field that keeps getting decremented and eventually they time out and go away and there've also been some pretty interesting disasters back in the I think in the early 2000s there was a single tunnel that had fiber that partitioned the internet and so one side of the internet was on one side of that fiber and the other side was on the other and there was a truck fire in there and it actually took out the ability to communicate across the internet until they fixed it okay there's a lot more redundancy now but this is a pretty chaotic process and it's fascinating and good reason to take 168 I'm sure they talk about internet routing. The other thing I want to talk about here is these internet or these IP addresses either IPv4 which are 32-bit IP addresses or IPv6 which are 128-bit IP addresses are really not necessarily ones you can remember easily and so really since we've got humans in the picture we need to go from name to IP address somehow and how do we do this you know we want to map something like www.berkeley.edu to 128.32.139.48 or google.com to believe it or not the closest google facility that can service you so these human readable names need to go to IP addresses so that the underlying system can then route them and how do you do that well you need a system that goes from human names to IP addresses and this is necessary basically because you know humans have trouble remembering IP addresses unless they are particularly attached to them and IP addresses also change so if one server crashes and an alternative comes up you'd like the name that the humans are using to automatically switch over to the new one so that they don't have to know that there was even a failure and so the mechanism for this as many of you know is called the domain name system okay and the domain naming system is hierarchical okay so there's a top level of the hierarchy that's managed by a centralized organization and then for something like berkeley.edu the next level down is edu and then there's berkeley.edu and then there's eecs.berkeley.edu etc all right and so it's hierarchical and organizations own parts of the hierarchy okay and this top level organization is just a really big organization that's global okay so it's a hierarchical mechanism for naming names are divided into domains right to left as I mentioned so start with edu then berkeley.edu then eecs.berkeley.edu and let's see resolution is a series of queries so when you're somewhere and you want to get to you know mit.edu then here I'm attached at Berkeley I might see whether my local cache has a has a address and if it doesn't then you work your way up the hierarchy to get to the edu domain which will then tell you where mit.edu is and then it'll send it back to you for a full resolution or may tell you how to get to mit.edu which then has the server you're really interested in okay and there's caching because this is expensive as you can imagine and so what's interesting is the caching is loosely consistent and so it takes some time for the cache to time out and so if you make a query and it gives you one answer and then something changes you don't always get a very quick change to the answer okay which is one of the reasons that DNS is not great if you've got items that are moving rapidly and changing their IP addresses with some frequency so you need something different and perhaps we'll talk a little bit about that in some of our remaining lectures but how important is this correct resolution so if you notice when I'm trying to get to a particular server like www.berkeley.edu I need to know that it's 169.229.131.81 and I want to do that in a way that maybe a malicious person can't get in there and give me the wrong answer because that could at minimum deny me service and it could also potentially be a security hole if I am not careful and notice that this server is not the one I thought I was talking to okay so how important is the correct resolution very right so get somebody to route to a server thinking they're routing to the different server and get them to log into their bank and give up their username and password now of course one of the ways that banks prevent this is by having certificates but for certificates can also be faked under some circumstances and so an incorrect DNS resolution complete with a breached top level certificate can lead you to route to something and give up your username and password if if the wrong sort of circumstances happen so you might ask is DNS secure it's definitely a weak link in this whole process because you think you're talking to one thing and you're actually talking to something else and the answer is DNS is not always been secure what was interesting is in July 2008 there was a hole in DNS that was located and the security researcher actually discovered it and then quickly informed a bunch of authorities about this before it was published in a conference and it was a very high profile problem and basically because DNS wasn't properly authenticated it was possible for one node to send out a query to to a top level DNS server for instance and somebody quickly comes in and gives a different answer and it wasn't noticed that the person answering wasn't the one that we were asking the question of and you could actually pollute the DNS caches of a whole ISP in one swell poop so to speak when fell swoop sorry joking and as a result this was a pretty serious bug all right so DNS is definitely a weak link and it's had many upgrades over the years here so so now moving on we need layering in our network which is building complex services from simpler ones the physical link layer is pretty limited okay so basically if you look at what can go on an ethernet link or or a wi-fi link there's a maximum transfer unit size that's often in the 200 to 1500 bytes in size and across slow links the MTU can get small okay and so packets actually have to be fragmented up into small pieces to get over long distances and they have to be reassembled or something in order to basically allow us to do something large so our goal in the following few slides and we're going to pick this up next time as well is basically going from the physical reality of the networks to the abstraction we really want so we're going to go from packets which are limited to messages which are potentially unlimited okay and this is kind of like our virtual machine abstraction we talked about at the very beginning of the lecture or the very beginning of class I mean so packets are of limited size but we would like arbitrary size communication packets are not ordered all the time they can be reordered we'd like ordered messages packets may be unreliable and lost we'd like reliable ones packets basically communication is machine to machine we would like it to be processed the process instead packets might be only on a local area network so using the MAC addresses whereas what we'd like to is route them anywhere they might be in a synchronous because they're just sort of being sent when the hardware is ready where perhaps we want something synchronous where we can do some synchronizing on it packets might be insecure we want secure ones and so this is basically an abstraction process of giving us a better communication mechanism than what the hardware gives us okay so that's a theme that we've had throughout the term so process to process communication is a good one to start with so you know machines have an IP address and so that's a machine to machine communication what we really want is routing from process to process which you know process on machine a to process on machine b and the way we do that as we've talked about earlier is by adding something in addition to the IP address we're going to add ports and so basically a communication channel which we have mentioned is actually a five tuple of source address and source port that tells us what application we're talking to at the source side destination address destination port that tells us the application at the destination side and then a protocol which tells us sort of what level of transport protocol are we using and the protocol what those protocols are things like tcp or udp etc and just to see the simplest example of a protocol this is ip protocol 17 and remember the protocol field if you were to look back at the header earlier is a is an 8-bit field so when we fill that 8 bits with 17 the number 17 that's going to be up in these 20 bytes then we've got a datagram and in addition to that we add a new header we wrap a new header on this which has a source and destination port which now let udp go from a an application at one side to an application at the other and there's some additional things like a link for our udp data and a checksum and so on but very very simple protocol here for udp it's an unreliable datagram from application to application and it's often used for very high bandwidth video streams etc but you can be very antisocial about your use of udp if you send too much and you fill up your network okay so it has none of the well-behaved aspects of tcip which we'll talk about next time so just to finish this out if you guys can bear with me i have a couple more slides i want to make sure to get through here for the day but process to process delivery is technically a layer four or a transport layer thing okay and so if you look what we start out with our data we start wrapping headers so our data gets a transport header which like the udp header which adds a port to it and then we wrap a network header which gives us the ip address of our destination and then we wrap a mac address on top of that and that might be our ethernet address for instance okay and then it so this is going through several different layers in the operating system down to the physical layer where the data is actually transmitted and then it comes back up at the other side and we start unwrapping the data link layer so we only get to a node who has the same mac address as our desired destination it comes in we strip the frame header or the data link layer header off we bring it up to the networking layer that networking layer is going to check and this for instance could be a router in which case we see oh this isn't the right ip address we're going to forward it back down to a different data link later layer out of a port but if it turns out this is the ip address of the local node then we'll forward it up to the transport layer which will grab the port and that will further demultiplex it by forwarding it up to an application and so this idea of wrapping headers and unwrapping headers is a common theme in all of the layering that you're going to run into so there are many transport protocols we just talked about udp which is considered best effort ip and is protocol 17 protocol six it's a pretty common one that you're well familiar with called tcp which offers a bunch of it more semantics than then udp so it lets us set up and tear down connections with discards corrupted packets retransmission of lost packets gives us flow control and congestion control which really means that if we use tcp across the planet for instance the flow control and congestion control will actually make us good citizens and we won't use more than our fair share of the network links okay so that's a nice property of tcp there are actually a bunch of other examples that are kind of often not heard of but things like dcp which is a datagram congestion control protocol rdp is a reliable datagram protocol sc tp the stream control transmission protocol these are all transport protocols that you may not have heard of what's interesting about sc tp for instance is this like tcp but has a bunch of different streams that can be simultaneously connected the transport protocols do not provide a bunch of services okay that's up to applications and so when we get into things we can do with for instance udp and tcp like distributed storage or peer-to-peer storage we'll be able to do things like provide bandwidth guarantees or surviving change of ip addresses or so on okay so the the problem we're going to solve next time is the reliable message delivery problem which is basically how do we get reliable delivery out of unreliable packets all right and we'll pick that up next time so just to finish up for today in conclusion we talked about two phase commit and as a distributed decision making protocol first you make sure that everybody guarantees that they will commit if asked or that they won't and then everybody asks to commit and through these two phases we're able to get and either everybody commits or everybody aborts semantics as long as we allow people to reboot in the process we also talked about the Byzantine generals problem which is a distributed decision making with malicious failures one general n minus one lieutenants and some number of them may be malicious and here malicious is pretty much they can do anything they want and that maliciousness can include looking correct whenever they're probed but still behaving incorrectly and what we see is that it's only solvable as long as the number of nodes is greater than or equal to three up plus one we also talked about how blockchain protocols can be used for distributed decision making as well so we started talking about ip which is a data grant packet delivery used to route messages through routes across the globe 32 bit addresses for ipv4 and 16 bit ports we talked about dns which is the system for mapping from names to ip addresses flaws that have been discovered are problematic and they've been continuously fixed as they show up we started talking about how to get good semantics and next time we're going to talk about ordering and reliability okay so we'll we'll finish at this point i hope you all have a good wednesday and we'll see you on thursday