 Okay, so let's get started. Alright, can everyone hear me? Okay, so this is lecture 20 of Computer Science 162. So today, we're going to learn about why systems fail and some of the techniques that we can use at the hardware level and at the software level to fix the systems. Okay, so goals. First goal is to define what is fault tolerance. Second goal or second goal that we want to talk about is what are some of the causes of system failures. And then we're going to look at a variety of different approaches, both at work at the hardware level and the software level, and we'll look at what happens in data centers in the cloud and finally geographic diversity as a technique. Now Leslie Lamport is a famous computer scientist and he has this quote that's one of my favorite quotes, which is, you know you have a distributed system when the crash of a computer you've never heard of stops you from getting any work done. This happens to all of us every day. You go to, you know, I don't know, look at something on Facebook and you get an error. Why is the error occurring? Could be a DNS issue, could be a networking issue, could be a problem at Facebook. Many different things could have gone wrong. Okay, so to talk about fault tolerance we have to talk about the three etics. Integrity and reliability, availability and security. Security is the next two lectures, that's kind of why that piece of the pie is pulled out. So reliability or integrity is that your system does the right thing. So I, you know, go to withdraw $50 out of my bank account and the system correctly debits my account for $50 and gives me $50. Incorrect would be perhaps it debits $50 and nothing comes out of the machine. Now in order to have reliability and integrity we have to have a very large mean time between failure. So that's the average time between when something's working and when it's no longer working. Availability means it doesn't now. The system might correctly be capable of taking $50, debiting it and giving me $50 but if the ATM's not working it's not very available. So it means right now. I'm standing there in front of the computer. So for that we need to have a small mean time to repair so when it breaks how quickly can you fix it and that's, it's a function of that over the mean time between failures plus the mean time to repair. So we want very large mean time between failure very small mean time to repair. Now it gets more complicated in reality. So if I'm just sitting in front of my computer I can ask that reliability question I can ask that availability question. But when I go up to the ATM the ATM may be just one of thousands or tens of thousands of hundreds of thousands of terminals that the bank has. So this gets into the question of system availability. What does it mean if 90% of my ATMs are up and 99% of my database is up? Does that mean that 89% of the transactions are serviced on time? Not quite, right? You've probably all encountered this. You call up the bank on a Sunday and you, because you need to talk about your credit card bill or something like that and they tell you, oh I'm sorry the databases for your account is down for servicing. So everybody else is able to use the bank but you're not able to. From your point of view that is unavailability but the bank would probably say no we're up and available because for 90% of our customers they can do something right now or whatever, 99% if it's the database. So this is where it gets kind of tricky and people will play numbers. Google will have an outage, Apple will have an outage, Microsoft will have an outage but they'll say only it only affected .04% of our customers. If you're one of those customers who doesn't have email for two days it's an outage but for the other 99.96% of the customers it's not. This is how people will typically play games with fault tolerance. Okay so meantime to repair is something we need to talk about. This is really critical because everybody always thinks about the meantime between failure. You buy a disk drive what's the meantime between failure of the disk drive? Buy a computer what's the meantime between failure? But the meantime to repair is equally important as we saw in terms of availability. So the meantime to repair when you have an outage first you have to realize that it's broken. Now this can be because angry customers start emailing you or start sending tweets about your service that might be how you discover it. Probably not a good way. You want to have good logging good monitoring within your system to detect that something's gone wrong. Then you have to actually diagnose it and so here's where documentation comes into play and really having good operations people comes into play. So one of the reasons why we push you to write documentation is because when you graduate from here and you're writing code in the real world it's likely you're not going to be the person who's dealing with that code on an operational basis. Some other team or group of people is going to be dealing with it and they're not the Uber hackers that you guys are and so they need to understand very quickly in the middle of the night what's gone wrong with your code because otherwise they're going to call you and wake you up, make you come in. And if you look at outages especially in the cloud a lot of these outages occur during changes. It's the operations team that's making the change based on some engineering code that's been developed by the engineers and things go wrong. Without good documentation it can take a time a long time to diagnose what's wrong. And then you need to decide what you're going to do. So we talked about databases. Salesforce is a company that maintains a single physical database with a single schema that on top of that they put all the logical customers. There are hundreds of thousands of customers. And periodically they decide to change that physical schema. And so now they have to take terabytes of customer data and convert it from one database schema to another. It's a really scary process because if something fails in the middle of that they're either faced with trying to push forward and correct whatever the error is or to go back to their backups in which case it takes several days to go back up. So having a very good process for making that decision of are we going to push forward and continue this migration or are we going to go and have a multi-day outage while we restore is important. And again, lots of examples of cloud outages where the time to decide what to do actually ended up costing quite a bit of time as part of the outage. And then you actually have to act. So once you've come up with a plan of what you're going to do to repair you actually have to act on it. Okay, so there's a big difference between now that we've entered a hurricane season it's probably kind of quite apropos there's a big difference between fault tolerance where you might be trying to mask local faults within a system and you can do that with redundant hardware redundant array of inexpensive disks uninterruptible power supplies big large batteries that can run you for 15 minutes or more we're failing over between multiple clusters within a data center. There's a big difference between that kind of fault tolerance and what happens when Hurricane Sandy hits and then your data center floods or gets washed away. Many of the companies based in New York City had a big problem because their data center might have been up on the 20th floor but their generators were in the basement and when the water came in and flooded the power went out where in some cases actually their generators were up on the roof but their tanks were in the basement and so they had little bucket brigades going up and down the stairs trying to feed the generators so disaster tolerance is what masks an entire site failure so when you have a fire or flood or sabotage or most recently when we have a power outage on campus it took down all of our data centers but there was still a website that you could go to to find out what was going on it was emergency.berkeley.edu because that website isn't hosted on a machine on campus it's hosted on a machine in a remote data center specifically because the last time we had a campus wide power outage there was no way to find out what was going on the other thing that you can do is use design diversity so this is a technique where for example you might use different versions of Linux so that way if there's a particular bug or a fault that affects Linux it won't affect your implementation of it it'll just take down some of your machines alright so these are some of the base definitions another very important definition is what we mean by system availability so now when I talk about system availability I mean that a customer can use the system again it doesn't mean that all customers necessarily can use the system so this is divided up into availability classes they call them nines based on the number of nines that you have in this percent availability column so the first one is one none so that means 90% of the time your system is available that seems like a really good number 90% that's a lot that's more than most of the time it's 90% of the time your system is available let's look at it the other way around which is how much unavailability does this translate into so 90% on a yearly basis translates into what in terms of time per year an entire month right 36 days so this is your bank that says our systems are down for maintenance roughly 2 thirds of the Sundays a year you can't do anything with your account for that Sunday translates into 72 hours a month and translates into two business days that's a lot question? that's a very important question so there's a difference the question is about the difference between scheduled and unscheduled do you always use your schedule maintenance? absolutely everybody plays the game of not counting scheduled maintenance as part of your unavailability because it's scheduled it's not a bug if it's documented in the future the second part of this it's good to have a regular maintenance window even if you don't actually use it because again that way you avoid unplanned outages or unscheduled outages okay so two nines that sounds like it would be a lot better and it is now we're down to three and two thirds of a day per year a month and an hour and two thirds per week we're getting better and we add another nine now we're down to a little over a business day per year 43 minutes a month and 10 minutes per week and it just gets better by the time we get to six nines we're at 32 seconds of downtime for an entire year and three seconds in a month and two thirds of a second in a week so everyone would love to be their goal is class six in practice they don't quite hit it so google gmail which is our campus email provider Microsoft's hosted exchange three nines of unscheduled outages so they will occasionally have scheduled outages where they'll give you advance warning we're going to have a downtime during this period now do they actually hit that yeah pretty much so gmail in 2010 hit 99.98 so they almost got to four nines Microsoft doesn't publish an exact number but just says it was over 99.9 and again that's not a lot of outage time but even an eight hour outage that's going to make the national news for a company like Microsoft or Google and again the way to think about this is unavailability it's proportional to the meantime to repair between failure plus the meantime to repair so if we can reduce the time it takes to recover from a failure or we can increase the time of failures we'll drive down the unavailability so what causes systems to be unavailable lots of factors so this is from a report that analyzed enterprise IT systems and they found a bunch of different things lack of best practices for things like change control so again this is the reason why we want you to use CVS version control we want you to be able to go back and see when your code was working at 2am and now it's not working at 4am and you can't remember what happened in the last two hours well with version control you can look and see exactly what changed lack of monitoring the largest telephone outage in the United States at a time this was about 15 years ago at a time when there was roughly 180 million long distance telephone calls a day was caused by lack of monitoring of a phone switch that had gone on to battery backup at AT&T and then a ripple effect that took down all of AT&T's long distance network so monitoring is critical requirements and procurement operations we'll look at a few examples of operations gone wrong external failures in the network application failures and external services anytime you're dependent on the third party it's good to have two providers because if one goes down you can switch to the other one the physical environment and network redundancy backup the technical solution that you use for backup and the actual process that you use for backup how many people here backup their computer I'm going to have a graph that I'll scare you in a little while if you don't backup how many people who backup actually protect their backups periodically by restoring a file it's not good just to have a backup because the backup you might be recording garbage and you laugh but I've seen many cases where someone went to go and restore a file from the backup and the backups were corrupted and companies also have had this problem so not only is it important to backup and again if you're saying well I just keep everything in box or I keep everything in drop box you know you want to have multiple so if you're going to store it in your Berkeley box account also store it in a drop box account at the same time physical location and infrastructure redundancy and then storage architecture redundancy is another big problem so Jim Gray when he was at tandem tandem non-stop computers did a survey of failures in tandem systems now tandem non-stop has since been purchased by HP and these are the computers that power NASDAQ and the New York Stock Exchange and some large financial institutions and also airlines and places like that so these are businesses where every minute of downtime is millions to hundreds of millions of lost dollars in revenue and also lost confidence when NASDAQ has a computer problem like when Facebook IPO or a year later they had a problem everybody hears about it and that makes people rethink whether they should be in the NASDAQ market or they should be in the New York Stock Exchange or one of the other markets so any kind of downtime in these systems is absolutely unacceptable so in 1985 he looked at their uptime and so here's a graph of for three different years of the mean time between failures in years and most of these numbers look really good except for the total the system as a whole had a mean time between failure of 8 years again 8 years that seems awesome it seems like a huge number but not when you're a market that moves hundreds of billions of dollars a year a day rather that's not acceptable you want more like a 10, 20, 30 year mean time between failure biggest cause software the operating system software and second was hardware so what did they do they doubled down and they went through all of the operating system software and made it as bullet proof as possible didn't that sound familiar from project 2 bullet proof your OS and there's a reason why we use this as an example and also on the hardware so spending ridiculous amounts so they built hardware you know why build one when you can build two or you can build three or you can build four right so don't have one CPU running have four CPUs that are doing the exact same thing so again it's worth it to the customers to have everything be totally bullet proof because downtime is not acceptable so two years later look at this they had more than doubled almost tripled the means overall system mean time between failure by looking at every single thing and trying to beef up the environment so this is better power conditioning better UPS, better generator better operations I'll go over a classic example that Jim Gray used of an operations failure maintenance preventative maintenance is really good but you need really good logging and analysis of those logs in order to do preventative maintenance and prevent outages so what did the stock market do right before Twitter when public they did a dry run to see what would happen simulated against their systems of high load because they wanted to make sure their systems could handle the load of Twitter going public unlike what happened with Facebook going public so hardware you know became three times more reliable and software you know more than an order of magnitude gain so this was great tandem could you know parade these numbers around and then three years later suddenly operations had declined and software had declined so what do you think happened on the software side they had gotten this code so bullet proof and then all of a sudden you know it almost lost half of its reliability what's that any ideas software is never static you always have new features every time you introduce a new feature you introduce a new bug or probably actually you introduce a bunch of new bugs so you know they spent all this time bullet proofing but they had to freeze the operating system to do that once they got it bullet proofed and it's like okay let's add our features in they added in the features and reliability went down look at the entire system and you know be aware that there's this sort of tension especially as an engineer that you're gonna get of you know management is gonna be saying this has to be bullet proof it has to be super reliable oh and here's a list of 200 changes that we want you to make by the end of next week yes so the question is how do we have a meantime between failure for software and an overall system meantime between failure of eight years so this is looking at all the different distributions the Gaussian distributions of each of the meantime between failures and then doing the probability of a failure per year so the question is it would be worse because software is gonna fail every two years so software you know so again this is a distribution so not all systems will fail have a software bug right so some systems might run for 10 years without hitting that particular software bug while you know a disk drive dies and brings the whole system down to the floor it's not the minimum that causes the failure it's you know one of these failures occurring in a particular system okay so one of the big areas of yeah question in the back yeah so one of the things you know the question is how do they calculate the meantime between failures because some of these members are much longer than computers have been in existence so you know there's a couple of problems so they look at the error reports that are coming in the distribution of those error reports over time and use that to derive an estimated meantime between failures so these are all estimated meantime between failures these are not as a base on observing the data and I'll give you an example of it of how we can do something similar today with disk drives in just a moment the other issue here is that not all errors get reported so some of this is automated some of this is manual reporting on the operation side and so there is some underreporting that occurs so these numbers are a little bit higher than they may be in reality so one case that that Jim Gray liked to use was a very classic operations failure so I mentioned backups here is a small business network attached storage device it has four drive bays you can put two three terabyte drives in them and it uses RAID so you have one of these in your business and a pops a console alert where you get an email message saying RAID drive one has failed replace immediately we want to replace it immediately because meantime to repair if we suffer another drive failure before we're able to repair this drive we'll lose all our data because these RAID arrays with four drives can tolerate a single drive failure RAID five for example so operations guy comes along he's got his little tray of drives with him so he pulls the drive and the array dies so what went wrong here so there wasn't a correct policy so the policy for fixing it in this case is you pull the failed drive you take it out of the tray you put the new drive in the tray and you pop it back in and the array rebuilds I have one of these actually I have exactly one of these arrays and I do this every couple of months or years when a drive fails so he did that but no it was drive one that failed the question was maybe it wasn't drive one that failed so the question is or a common is maybe the disk drives tend to fail when you turn it off so in this case you didn't turn off the array these are hot plugable arrays so you can pull the SATA drive and replace it again I've done this many times in fact that's how I upgrade the array to larger drives is you pull a drive at a time put in a new drive plug the drive back in and then it rebuilds onto that and then you pull the next drive and so on they're all computer scientists in this room so what you might not be able to see because I covered it up there's actually four drives and four little lights so that's drive one that's drive two that's drive three so this is one, two, three, four he thought the drives were numbered zero one, two, three so he pulled drive one and now caused the array to have a two drive failure this sounds really silly but this is exactly the error that tandem was seeing happen in their arrays the operators were going and pulling the wrong drive now this is only four drives it gets even worse if you have a bay that has 32 drives and it says drive eight failed so that's one, two, three, four one, two, three, I mean it's very easy to make an error even if you know that it starts with one and not zero so tandem solution when the drive fails a little red light comes on and here the red lights at the top at tandem they actually put the light right on the button for the drive so you really could not miss and pull the wrong drive people are human and you have to realize that which means if there's a way for them to make a silly mistake they're going to make that silly mistake some part of the time and you're going to lose all your data we'll come back to drives in just a moment so the fault model that we're going to have here and that we think about typically with fault tolerance is we assume that faults are independent so if we have single fault tolerance that's a huge win now note there is this big giant red asterisk because in fact in the real world often times faults are independent if you lose power you're likely going to lose your air conditioning too because we discovered at our campus data center so you have to be careful because everything can be interdependent if you haven't planned for all of the contingencies you're going to run into problems hardware in general we're going to assume fails fast so if something goes wrong it blue screens it panics it stops immediately it does not sit there silently corrupting your data same thing with software we want our software to fail fast so it's much better for our software to just abort or hang or stop responding but it is to have it keep going because if it keeps going it could be corrupting data structures along the way now for software often times when something goes wrong we can just simply reboot we call those hyzen bugs those are the bugs when you kind of look at them to go away so we just reboot it goes away so if you had a transient memory leak that's kind of a hyzen bug and it'll work again eventually the memory leak will become a problem again and you can reboot again borr bugs named for Neal's board those are deterministic logic errors when those occur reboot's not going to help you it's going to just lock up again when you reboot it so this is where you have to go get a patch or you have to go look at the code and fix it again operations tasks are a major source of outages utility operations UPS and generator maintenance when we had the power failure on campus our data center has generators and a week's worth of fuel yet they had to turn it off about less than I think it was about an hour into the outage why? because the air conditioners refuse to turn on and they talked to physical plant and physical plant said we're looking at our systems your air conditioners are running fine and they're looking at the air conditioners that are off and the room's getting warm and when it hit 100 something degrees they had to pull all the plugs so classic operations failure no reason why the data center should have gone down but often times it's kind of amazing you'll do maintenance on your generator to run these things like once a month or so and you turn on the generator and your entire data center shuts down or you switch to the UPS to test them and the UPS don't take the load and again everything shuts down upgrades and configuration changes are another huge cause of failures so if we look at hardware this is called the bathtub curve so hardware failures are composed of three different types of failure rates so there's the early infant mortality rate so this is you buy a computer or you got a PS4 shipped to you last week and you plug it in and you get the flashing blue light so that's infant mortality burn in can help with that so that's where at the factory they plug it in they let it run for a period of time and they eliminate those things that are DOA it's an interesting tradeoff of doing this cause it takes a lot of time and manpower to do that burn in and so what most companies do now is they ship it to you and let you deal with it they try to make the quality control be high enough that the fraction of systems that fail is low enough versus the overhead associated with doing burn in plus with some things you can't do that you can't do burn in with a printer because as soon as you put a cartridge into the printer now you have toner everywhere so HP actually designs their printers and engineers them such that the process of building the printers has a high probability near 100% likelihood of you're going to plug in the printer at home and it's going to work and even with the PS4 Sony is saying I think the failure rates were something like .04% now of course if you're one of that .04% you had a then in the middle so that accounts for this part of the curve can sort of continuously there are random failures it's hard work especially if you have moving parts things are going to break and so this is just sort of design error or I should say build errors rather so poor quality control increase the sort of number of random failures that you'll have and then as things get as things get older you have wear out failures this is the yellow curve and so this is as hard drive gets older the lubricants dry out you have vibration effects that wear the bearings and boom things die dirt gets in through the filter and you have head crashes that occur so an online backup company Backblaze recently released a study so they have 25,000 drives and they released a study showing the failure rates that they see on those drives Backblaze is kind of interesting most companies that offer online backup they contract with Hitachi or Seagate or Western Digital and they purchase enterprise class drives which are a lot more expensive than the drives you get if you go to fries or someplace like that Backblaze is a different model they smurf their drives they just go to the store and buy them so they bought a lot of drives 25,000 of these consumer class drives and everybody said that's a disaster these drives will just fail like crazy there's no quality control on those drives so it's kind of interesting what they see in the first year of a drives lifespan is a 5.1% annual failure rate so this is sort of the infant mortality curve and this goes into the first sort of 18 months and then there's sort of a quiescent period where they only have a 1.4% annual failure rate for a drive and then at the start of year 3 they see an 11% 12% failure rate so it's kind of interesting now the first thing that's most amazing is that at the end of 4 years almost 80% of drives survived so even though a consumer drive that you buy might have a 1 year or if you're lucky a 3 year warranty gone are the 5 year warranties those drives actually will last quite a long time in fact if this 11.8% curve continued the 50% intersection point would be at about 6 years that means 50% of your drives will survive 6 years that's really good for consumer drives now of course the flip side of this is all that data you're storing on your laptop hard drive or your desktop hard drive do remember that 20% of those drives failed in the first 4 years so this is another reason to have backup so some of the traditional techniques that people use for fault tolerance fail fast modules they either work or they stop having spares and you can have very quick recovery time so with those NAS arrays a lot of the enterprise arrays have extra drive bays where you just fill them with drives that aren't being used but if one of the active drives fails it starts rebuilding onto that drive so that way the potential for a double drive failure the window for that is as small as possible we'll talk about process and server pairs is a way of masking hardware and software faults and then we've already seen transactions give us really nice acid semantics if something goes wrong just abort, try it again if there's a hyzen bug that will fix it if there are repeated aborts of the transaction you'll need some external intervention so fail fast is part of what you need so here's the life cycle of a hardware module it's nice and happy running normally and then there's a fault and you have to detect when that fault occurs and once you've detected that fault you then have a repair process where you repair it to a working state and then you return it to active service so what you want is as soon as you get in the fault state as quickly as possible get back to an operational state again because having low unavailability will give you high availability and the mean time to repair is critical to improving that so again either shortening the time to repair or increasing the time between faults is going to give you better availability now you might think if I've got one drive and that drive might fail why don't I just simply have two drives and store the same data on both drives but it turns out simple redundancy doesn't really help you that much and it can actually hurt so let's look at how we can make our hardware reliable so one way to do that is to build two you know why build one when for twice the money you can build two so here we have a duplex module with an output comparator so I can have two CPUs and then compare a lockstep the output from those CPUs if it ever disagrees I know one of them has failed so in that case if either one fails I take this module out of service so that's this case so we start with both working and in time mean time between failure over two because we've got two of them either one of which can fail we're going to go to one working now as soon as we go to one working these disagree and our system goes down and then time mean time between failure to having zero working so in this case we end up with worse mean time between failure it didn't get better we've got two CPUs if either one fails on the other hand if it fails soft which is what we can do with disk drives or with network cards because a disk drive can tell us it's failing you can throw an ECC error or report unrecoverable data error in which case we can take one out of service and not have to take the entire system out of service so now it's time mean time between failure over two to go from two to one and then another mean time between failure to go to zero working so now our mean time between failure is one and a half times the mean time between failure of either drive so this kind of logic this is called the two airplane rule because the two airplane a two engine airplane rather I should say it's called the airplane rule because a two engine airplane is going to have twice as many problems with its engines as a single engine airplane but would you rather fly on a one engine airplane or a two engine airplane rather fly on the two engine airplane because on the one engine airplane even if that engine fails you're going down whereas the two engines you can continue on this is why the really big planes like the 747 and the Airbus 380 have four engines because they carry a lot of people you want to guarantee that if you have multi engine failures you still can make it to a safe landing now one of the things they've also done on the planes is increase the mean time between failure for the engines so you look at the GE90 engine on a 777 it has an insanely high mean time between failure so even as a two engine plane it can go very very far from land because the probability that you'll have both engines fail far from land is extremely low okay so if we add repair which of course we can't do with a plane unfortunately at least not midair if we can put a module back into service quickly then we can gain a substantial increase in the mean time between failure so here we have our mean time between failure over two for transitioning from two to one and mean time between failure to transition to zero but we also have a mean time to repair and if we can take this module that's gone out of service and bring it back into service before a mean time between failure we'll have complete redundancy with no outage so if as an example we have a one year mean time between failure and a 12 hour mean time to repair using this equation the mean time between failure squared over two times the mean time to repair we end up with a 1,000 year mean time between failure that's the probability that we'll have a failure while we're in our repair interval so that's really good so this is why having hot spares is a real valuable thing okay any questions so a few administrative notes project 3 code is due on Thursday and we've actually got project 4 design date so the due date was originally going to be Tuesday and then you have a midterm on Wednesday we've figured it would be better given the number of points that the midterm is worth versus the number of points that the design doc is worth that you spend more time on the midterm so we moved it back a day to Monday the second so that really means you should be starting after you finish your code you should be starting for project 3 on project 4 right away any questions okay we'll take a break now okay let's get started again so we can learn from some of the techniques that hardware designers use from the software standpoint so for example avoidance starts with a good and correct design again this is the reason why we push you to come up with a design first before you rush out and implement it a lot easier to understand the algorithms when you're looking at a design than it is when you're looking at raw code in addition there are lots of software fault tolerance techniques that you can use some of which you already are using like for example modularity modularity gives you isolation and fault containment because each module can sanity check the inputs that are coming in it can also sanity check the outputs that it's generating for failures having a programming paradigm like for example map reduce that assumes failures are going to be very common is really good because it automatically hides those failures from you defensive programming so again we're going to talk about security this becomes very important in the security standpoint is making sure you sanity check any data that comes from a user whether it's a local user or especially if it's a remote user so all those exploits and vulnerabilities occur is because someone doesn't sanity check an input that comes in remotely over a network port or on a web server or so on how many people have flown on an Airbus plane so if you've flown on an Airbus plane you used inversion programming so the aileron controls on the wings on an Airbus plane were developed the software was developed by two different independent software teams the same specification they produce different code that does the exact same thing the idea was in case one of the teams had made a logic error you'll be able to detect that auditors so these are code this is code that checks data structures in the background so databases typically have these they run through and they make sure that the integrity checks or some other checks still hold true so this is a way to detect when corruption has occurred and potentially sometimes these auditors are capable of repairing corruption and finally transactions this is something we've already talked about but it's an easy way when there's a failure just simply abort the system automatically cleans up for you alright now as a programmer you might think oh well you know I'm really good about doing my tries and catches with exceptions so here's the case where given a file name you try and create the file if you can't create the file then you print out a message saying unable to create the file along with the error message now this is fine if it's your own computer and this error pops up but imagine you're the pilot of a 787 and it could not create file name pops up on your heads up display probably not a good thing so you caught the error but you didn't necessarily do anything useful with it so it certainly has failed fast but in this case definitely not the desired behavior the desired behavior might be something like we recreate the missing directory and then it succeeds so you know this is again there's different ways to bullet proof this is certainly bullet proof it will catch the error when the directory does not exist but at 40,000 feet you probably want to do more than just simply report a problem I gave you an example of using hardware replication at the lowest level at the CPU level or at the disk drive level we can do the same thing on a server in software so we can run instead of one web server on our machine two one is our primary server and the other is our backup process so these are one physical machine with a session there's a web browser at the other end of this and two processes so from our primary web server we'll be pushing the state information down to our backup so we have two separate processes with a backup ready to take over if a fault occurs we just switch over now if it's a logic error a bore bug it's not going to fix it there's nothing we can do we can run the backup process it'll eventually hit the bore bug again and things will stop and your only solution is go get a new piece of code or switch your vendors if it's a hyzen bug we can simply restart the primary process and now we just push the state information in the other direction and we've restored this is an approach that can give us very fast response time millisecond level recovery time from a failure and we can tolerate some hardware faults if there was memory corruption or something like that that caused the primary process to fail we can tolerate that kind of fault now what we do in a single system we can actually do with multiple systems we can take our entire system and we can replicate it now this can be replicated in the same data center could be in different racks could be in different coasts so a lot of banks and other institutions and also a lot of the internet companies will have multiple servers connected over the wide area so this one might be in Sunnyvale this one might be in Virginia and everything is replicated the programs, the data, the processes is all replicated now logically from the outside to as a customer it looks like one logical system you don't log into Facebook East or log into Facebook West even though they have multiple data centers you log into Facebook and then behind the scenes they route you to the data center that's available and we push all the state over this link to the backup so this becomes an interesting question because when these things are far apart there's a latency associated with that and there's also a cost these wide area links can be very very expensive we might have to push many gigabits per second of traffic across that link but assuming we have that if the primary fails or we want to do maintenance on the primary we can just switch over to the backup and the backup offers service instead now if we have workloads that require more than one server we can just scale so an easy way to do this with an open source package is to use Apache ZooKeeper so how many people have used ZooKeeper? oh just one, what did you use it for? you worked on oh okay yeah, yeah nice so what is ZooKeeper? it's an open source Apache project that allows you to take a collection of servers and have everyone to be in charge it sounds really simple but it's actually really complicated when you've got a bunch of servers everybody wants to be a leader and how do you pick it? it does leader election it also does group membership so when I turn on a new server it needs to join that group and then those servers can all manage shared states so you get a highly available scalable distributed kernel coordination kernel and you come up with a master ordering for all of the operations that the clients are doing and persistence guarantees for the operations that the clients are doing so I recommend take a look at it it's actually a pretty easy piece of software to use we use this in my research group so when you're going to go large we're talking about a very big data center and we're talking about very large code so now we're talking about programs like the scale of email distributed calendaring mapping applications and your computer is a warehouse containing tens of thousands of machines or hundreds of thousands of machines these machines however are a lot less reliable so in the past when we built a data center we built it with mainframes incredibly expensive multi-million dollar computers today when we built it with a data center we built it out of computers that cost less than a thousand dollars but they're not quite as reliable as that multi-million dollar mainframe I don't know how well you can see this picture but this is a picture from the computer history museum down in the south bay and this is one of Google's original racks now it looks kind of different from a traditional rack of computers because there's no cases these are actually the motherboards sitting on the rails it's kind of hard to see the motherboards are kind of curved the reason why they're curved is because of these gray objects these are the disk drives they're sitting on top of the motherboard so they're just kind of bowing the motherboard down and then there's all this loose wiring so these things had incredibly low uptimes and Google needed a way of making that work this is what drove the creation of MapReduce because you know this is going to fail one of these machines is going to fail during a long running computation so let's look at the architecture really quickly of what's inside a data center this is one of the biggest challenges that a company faces when they want to build a 10,000 or 100,000 node data center huge, huge engineering issues it's also Google's big secret sauce they never reveal or when they do reveal it's when they've decommissioned a type of data center how they do it Facebook taking a completely different approach they're saying let's just open source this so everybody can do this why are they giving up that competitive advantage? because they believe that if they can design super energy efficient data centers and everybody adopts that it's good for the planet so it's actually a good reason but it is interesting that they're giving up the big secret sauce that is what gives Google a lot of its power so let's look at a tier one data center it looks like inside of it there's n plus one redundancy so that means if I need one I buy n times of it and then add one it's very, very expensive to build and you have multiple uplinks in terms of communication uplinks they come into different parts of the building why? so if a truck hits the side of the building that has the fiber connection it doesn't take out the entire data center because you've got the other connection so power is one of the hardest parts of it and one of the most expensive parts of it so here's what it looks like inside a tier two data center so this is one level below a tier one and power to run the data center is 40% of the data center's cost over the lifespan of the data center the cost for power will exceed the cost of building the data center and the cost of buying all the computers that go into the data center so if you go to Google their number one cost this is why Google is making these huge investments in solar why Apple is making investments in fuel cell technology Facebook, all of these companies they locate themselves near hydro plants because you can get cheap power inside the data center when you're building it 20% of the cost of building your data center is power redundancy infrastructure so going back to our tier one data center Miami AT&T data center they bring into the data center two different commercial feeds at 13.8 kilovolts coming from two different substations from two different commercial power providers so either provider can have an outage they still have enough power to run the data center it comes into a large switch gear room switch gear is where we can switch between commercial providers you can also switch between commercial providers and your generators if you lose commercial power for more than 7 seconds you immediately start your generators now generators are basically big giant diesel engines really big diesel engines you may have seen them around campus when we had our outage and they have a lot of fuel it's actually kind of interesting because after class right after we had our outage the next week professor can and I were walking back past one of those generators and what was there but a fuel truck pumping diesel into it so these are big each one of these generators produces 2.5 megawatts of power there are four of them so that's 10 megawatts total capacity but they take time to come up to speed and so during that time we rely on lead acid batteries that can provide at least 15 minutes of power and again it's really amazing the number of times when power goes out that the generators fail to start all four of them so testing again is very important all that power that comes from commercial power and the generators gets fed into these large massive motor generators a motor generator is basically just that it's a motor massive flywheel and a generator and the idea is that the massive flywheel provides you with high inertia and so you can use this as a way of getting extremely clean power out of it if there are sags or dips or spikes or under voltage the motor generator can tolerate all of those because of the incredible inertial mass and give you very very clean power so all of this is incredibly expensive you have to buy there's only a few companies that make generators over a megawatt in size and there's a long waiting list to buy those because everybody's building big data centers so companies like google and facebook have been looking at alternate approaches one of the approaches they've been looking at is not having generators so if a data center is going to go down I just shift the workload somewhere else and not having big massive ups instead they just put a battery with each rack and they distribute within the data center power at much at 277 volts so they step it down and by doing this they're able to reduce the site losses that would happen with having a central ups they're also able to reduce the distribution losses because they can convert to 48 volts centrally and then send around 48 volts but it's just a taste of the challenges associated with building a data center from the power engineering side of it lots of innovation that goes on in that space now there are many many different commercial providers that offer cloud computing and this is the great part there's amazon, there's google computer engine, there's azure there's rack space and they give you inexpensive virtual machines so amazon will sell you a virtual machine for as cheap as one penny per hour it's like buying a server except you don't have to buy the server and whenever you're done with the server you turn it off if you need more servers you buy more servers in hour it's a very nice model and competition is great it's driving down the prices every 6 to 8 months amazon announces a price drop google computer engine, they roll it out they announce huge savings and prices, amazon follows suit now is the best time to be graduated i mean it doesn't get any better if you're going to go and do a startup this is the ideal time because it used to be you had to go to the venture capitalist and say give me 2 million dollars so i can build out a data center with several hundred computers because i have this really cool app and i know we're going to get millions of users and the vcs would look at you and say yeah i think maybe, yeah and then you build it and 100 of your friends show up and use the app to build a data center not very good so vcs don't like that anymore they love this model because now if a million users show up i spool up a thousand vms and if those users go away because of the dig effect or the slash dot effect i spin down those instances and so i can perfectly track my expenditures with my demand and eventually you reach a crossover point perhaps where it becomes feasible for you to go to your own data center but you can push that out as far as possible now to make all of this work we need a model that can program let us program in an instance where we might have failures this is what map reduce really made possible map reduce made it possible to work on petabytes of data per day many tens of petabytes of data per day for novice programmers if you have time to come up to scale and programming it google is only a week or two and now you're dealing with thousands of computers and petabytes of data when map reduce was announced which was now almost a decade ago people were just shocked there were like 20 petabytes in one day we don't have 20 petabytes of data in our company not alone be able to manipulate that data but when you're a company like google that's exactly what your customers are doing your users are doing and targeting ads you need to process massive amounts of data same thing for Facebook they're processing many tens of petabytes a day using map reduce and yahoo too very simple programming model everybody here has written map reduce programs but what you may not have seen is the fault tolerance side of it so if a task crashes we just simply run it again in a different node if a node crashes we relaunch the tasks that we're running on that node the fundamental assumption here is your tasks have to be deterministic and side effect free so item potent is important if a task is going slowly we just simply launch a second copy of that task very simple this is really important for getting good performance out of these large clusters because in these large clusters I may have a machine that's starting to fail if that machine is starting to fail it's going to run slowly and so I want to make sure I can launch a task elsewhere and have it finish faster but there are many other distributed applications that you can run in these environments web applications services they have many complex moving parts many interdependencies that are often hidden so here's a website that's kind of fun to go and read it's called analysiscasestudyblogspot.com and they have all sorts of comments about outages and the causes of the outages so some of these they always occur at the worst possible time it's a Microsoft had an Xbox Live and Azure outage that occurred on December 28th until December 31st so right after Christmas after everybody got those Xbox 360s but they're not the only one not to pick on them Amazon had an outage December 24th so you've got all those little kids screaming for their presence and you're trying to keep them happy with watching some Netflix SpongeBob SquarePants or something like that and Netflix is down there are thousands of very angry parents and that was caused by a problem with EC2 the elastic load balancers they loaded an error at configuration then during the repair process they made it worse they didn't make it better so that caused the outage to be longer that's why it's very important to have good repair processes again Amazon's not the only one Google and Microsoft have had email outages caused by updates that went wrong Dropbox headed outage undetermined they're the worst you really do want the exact cause to be determined because you want to make sure it's not going to happen again especially if you rely on Dropbox for some mission critical part of your company so one of the lessons that Netflix learned from having lots of angry parents when Amazon went down is you need to break your systems purposefully to see whether they can recover or not publicly this thing called the chaos monkey what does the chaos monkey do it runs around the Amazon data center killing Netflix virtual machines now most of the time nothing happens because that's what's supposed to happen a failure occurs and it just transparently routes to somewhere else but sometimes things go horribly wrong and then that becomes a learning lesson that they have to go and fix now after the Amazon outage so as an example there was an April 2011 EC2 outage that affected a lot of startups some of the biggest Netflix had no impact but after that big outage in December now they have the super chaos monkey so they've replicated Amazon's replicated across multiple data centers the super chaos monkey now kills an entire data center and sees what happens and again most of the time nothing happens completely invisible to the customer occasionally it's visible and that points out here's a single point of failure that we hadn't identified very interestingly they use Cassandra as their data store and they do a dump of Cassandra into Amazon long term storage periodically and one of the important things that they just started doing is taking one of those Cassandra dumps and putting it in Google compute engine why? just in case Amazon has some massive failure that should corrupt their long term storage they have all of their customer data somewhere else so very important thing to think about when you're doing a startup is think about the third party providers and how many of them you have so the last thing I want to say is you can add geographic diversity so here's a kind of old picture but it's one of the few pictures you can find of supposedly where all Google's data centers are and you can see they put them on multiple continents they do this for a variety of reasons one is they want to be close to the customer but the other is they want to be able to tolerate things like you know transatlantic fiber cut because someone's dredging where they shouldn't be dredging or something like that and so geographic diversity is a way to help reduce single points of failure I put the big asterisk and I say reduce instead of eliminate because there are still services that you will find are global so for example Google and Amazon have had failures because of authentication service failures where Google had a failure that was due to a load balancer problem that took down one load they were doing an upgrade on the load balancers it took down one load balancer and then it rippled across and took out all the load balancers and so Google went dark for a period yeah question do I think there will be servers on the moon so you know I would have said no because of the bandwidth issues but NASA just did a study where they are sending using fiber a very high bandwidth connection to one of the orbiters on the moon so aside from latency issues I could easily imagine someone launching a few petabytes worth of storage up to the moon and then doing rsync over a laser link I mean you know you well it's moon storage not quite cloud storage but disaster recovery if something major happens we get a meteor that strikes nice to have a backup somewhere else so in summary today we focused on reliability integrity but there's also security right even if your system is up but the hackers are running around deleting all your data inside the network or this ransomware that goes and encrypts all your data and then demands bitcoin payments your system is not very available we looked at how hardware and software fault tolerance can increase the mean time between failure and simultaneously reduce the mean time to repair so we want to make it longer between failures to make it shorter to recover from a failure we build reliable systems today from unreliable components this is perhaps you know one of the biggest changes that we've seen in the last decade is a shift from expensive reliable components as our building block to very inexpensive unreliable components you see that with google and others you see that with backblaze going out and buying consumer drives anything you always have to assume that whatever could go wrong actually will go wrong because it will eventually and usually at the most inopportune time so if you plan for that and you design your systems around that then those won't be outages still no matter how well you design your system there are always going to be blind spots and so I really think something like a chaos monkey is really cool and they open sourced it against a bunch of VMs and see whether your service really could survive losing multiple components make your operations bulletproof so again operations after software are one of the biggest sources of unavailability and so remember it's not always uber hackers that are your operators so make sure when you're going to do a configuration change or an upgrade do it during the middle of the day most companies like to do it in the middle of the night and then something goes wrong and everybody's asleep in a way and then apply replication at all levels including globally or maybe even on the planetary scale will help any questions? okay, edicapa new was supposed to be here today and I don't know what happened so I guess we'll have to do our course survey on a different day so with that I am done thank you