 All right Nate are you ready for the explainer of the Facebook outage take it away all of Facebook services from WhatsApp to its building security system went down Monday October 4th for six hours you heard us talk about it briefly yesterday Facebook VP of infrastructure Santos Jinnardin posted twice With some good details about it explaining it was a configuration change at the root of the problem and emphasized Despite your best imaginations. There was no malicious activity. No user data was compromised This was a very bad and unfortunate mistake Cloudflare you may have read yesterday posted a great explanation of what it looked like from the outside And that's where you may have heard people talking about DNS and BGP as explanations for the outage They contributed, but they're not the cause DNS is often referred to as the phone book of the internet So when you type in a domain name like facebook.com DNS is a system that tells your browser What server that domain points to so you can go get the webpages if you have only one web server That's easy. The DNS looks up Nate's only server.com sees what IP address it is points at that tells the browser Gets the machine. However, if your network is larger like an ISP or Facebook You've got a more complex system. You've got multiple servers and in these systems The border gateway protocol or BGP works. It's often compared to a postal system So when your browser sends its request in for Facebook.com BGP figures out Just which server is best for the job it Advertises to the rest of the internet here are the servers that are available that way you get connected to a server That is near you for faster service. That's an oversimplification, but it's kind of right in Larger networks like ISPs What cloud friend notice is that all of Facebook's BGP was withdrawn BGP tells the internet where to find Facebook's DNS servers that meant any request once the BGP was withdrawn For a Facebook domain Returned what's called serve fail you often update BGP to yeah, maybe you had to take a data center down for maintenance He usually don't withdraw the entire thing. So Serve fail Tells other DNS tables around the world to start updating their files to show that any Server associated doesn't exist if you take away all of Facebook's BGP That means all of Facebook.com no longer exists to the rest of the internet That caused some automated systems to assume it was available for sale Which it was not you could see that on the official icon record, but if you saw that that's what caused that Facebook says the problem was caused by the system that manages communication between its hundreds of data centers around the world It's got fiber optic cables. It's got undersea cables It's got high-rises full of data centers and it's got a system that manages communication between them on Monday engineers were doing routine maintenance Often that means taking down a part of the system So they did a routine command to assess global availability just to make sure they had only taken down the right parts the command for assessment, however is the problem Apparently it was malformed because Facebook Says that that command unintentionally took down all the connections within Facebook's internal network Facebook has an audit tool that's meant to catch these kinds of human errors But there was a bug in the audit tool So it didn't catch it double failure So the upshot is Facebook's data centers now had no way of talking to the internet and that leads us to the next failure To keep the network clear of junk Facebook's domain name system servers the DNS servers disable BGP if a data center is unavailable That way you're not telling the internet to send requests to something that isn't there That's normally normally a very smart thing to do. However in this case all The data centers because of that configuration error appeared unavailable. So the DNS dutifully disabled BGP for everything Including the DNS servers themselves Meaning DNS appeared unavailable to the internet Data centers couldn't be accessed because of the configuration bug and the tools you'd used to investigate that were now down Because DNS was unavailable. You can't go to Facebook comm slash audit tool when DNS is like there is no Facebook comm Or I actually the internet saying like we can't find a DNS for Facebook comm. So Facebook Had to send people to the actual data center. That's why they had to do that They had to debug the issue directly on the machines themselves and of course Facebook wisely I think makes it hard to get into the data center to access its servers and physically modify them Even if you have the right to access because how often are you gonna need to do that, right? You want to make it hard so you can catch bag bad actors. So it took a long time to do it. It did not Mean that they had to cut things open with an angle grinder as the New York Times briefly reported then retracted that did not happen and Then even after they fixed the machine turning something as big as Facebook back on all at once would cause huge traffic surges Power surges that could could could cause electrical failures So this Facebook had drilled on as part of its storm recovery plans and was able to bring back a little more slowly than you might have thought But without incident just took a little more time to do it, right? Jarnardon finished his post by saying we've done extensive work hardening our systems to prevent unauthorized access And it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious Activity, but an error of our own making. I believe a trade-off like this is worth it greatly increased day-to-day security versus a slower recovery from a hopefully rare event like this from here on out our Job is to strengthen our testing drills and overall resilience to make sure events like this happen as rarely as possible So Nate there you go. That's what happened. It all seems understandable I feel feel very much for that admin or that engineer who typed in the wrong command who Malformed that command and can I just can I check on your little soundboard there? Have you still got your round of applause sound? I do. Can you just press it quickly? That's on behalf of everyone listening who is impressed with being able to sum up all of that in five minutes because that was That was incredible and I have nothing to add