 My name is Paul. I'm from HashiCorp. And here is my intro slide. This is a picture of me from my first week at my job at HashiCorp. I like this picture because you can clearly see all the layers of emotion, the disbelief, excitement, and fear, ultimately fear. It's all in the eyes. You may recognize some of HashiCorp's products. We make Vagrant, Packer, Surf, Console, and Terraform. I'd like to start today by talking about abstraction for my money. Abstraction is the single most important concept in computer science. It's a principle that underlies so much of modern technology. So what is abstraction? In the talk summary, I called it strategic lying, which is a little bit unfair. For today, let's define it like abstraction is the use of simplifying metaphors to manage complexity. The word comes from the Latin abstractus, which is the past part of this participle of abstra here, which means to draw away, which fits perfectly with our usage of it in computer science. Which I think of as sort of like, oh, don't look over there. Look over here. Or don't worry about that. Just worry about this. The use of abstraction has definitely enabled us to build some pretty odds-like wonders, case in point to the internet. So to get a feel for this concept, I thought we could quickly go through one of my favorite nerdy exercises, which is to break down some of the layers of abstraction underlying a single web request. So let's say you pull up GitHub in your browser. Most of the time, that's exactly how you think of it. I'm pulling up GitHub in my browser. And as I'm sure you know, there's a lot happening when you do that. It's all happening in such a way that you don't have to think about it. So for the next few minutes, let's think about it anyways. So there's a pretty complicated layout engine inside of your browser that determines the position, layering, shape, font, color, and behavior of each element on the page. So how does the layout engine know what to do? Well, it consumes large blobs of text that tell it what to do. The text comes in several formats that I bet most of us all know all too well, HTML, CSS, and JavaScript. Along with assets like images and fonts, these all form the inputs to the layout engine, such that it can output a web page. So we know that a browser needs blobs of text to function. Where does that text come from? Well, the browser just asks GitHub, which is, what does that mean? All right, so first we have to get GitHub's address, which the domain name system is really vast and intricate. So that's one of the layers of abstraction that we'll go unexamined until a little bit later. But for now, suffice it to say that the browser asks DNS to look up the IP address of github.com and gets an answer back. Now that we know GitHub's address, how do we ask it to send us the blobs of text that we need? Well, that's the simple text-based protocol called HTTP that, again, most of you probably know pretty well. So we know how we're going to structure the conversation with GitHub to get our blobs of text. But how do we actually get messages back and forth between our browser and GitHub? For that, we need a connection, and that's where TCP comes in. TCP establishes a connection between two nodes on the internet and allows them to exchange text. It's basically just a two-way pipe for slinging messages back and forth. That's all well and good, but the browser's not directly connected to GitHub servers, so if I'm doing this from Chicago or I'm from, that's actually kind of a lot of distance that needs to be covered in order to establish that connection. So that's where the internet protocol comes in. So how do we get from Chicago to San Francisco on the internet? You can actually ask. There's this cool little tool called MTR. There's a bunch of them, but this is one of them. Displays all of the hops between one computer and another. So in my case, that's actually 13 hops with eight different possible servers in that fourth hop. So we can do a GeoIP lookup on each of these and actually visualize the path that the packets are taking. So the first two IPs are gonna be my router and my ISP is router in Chicago. Then we head to the DC area for two hops, Tennessee for four, and then a bunch of difficult to pinpoint nodes somewhere in Colorado and Utah before finally arriving in San Francisco. So how did that path get chosen? Well, we'll take a look at the mechanism for which that path gets chosen, but if you actually note, when I was looking at these, these are all Comcast IPs and you can actually type in Comcast Backbone Map and you will get that line basically. And there is not a good line from Chicago directly to either of the nodes in our area. So it makes a lot of sense when you realize that Comcast prefers to use Comcast pipelines. But from the mechanism standpoint, it's a simple set of rules that are applied at each node. So at each hop, a router receives a packet with a portion of our message and the target IP of GitHub server. That router then uses a routing table to look up which of its upstream connections is the best link to send the packet. It makes a decision and then forwards that packet along to that router which does the exact same process and happens over and over again 13 times until you eventually get there. So how do those routers keep their routing tables populated? How do they know what's the best next hop? They use something called border gateway protocol or BGP. I think BGP is like one of the coolest things and it's really hard for me not to turn the rest of this talk into just the BGP party. But for today, let's just mention that BGP is a protocol that allows routers to exchange reachability information with each other and the major concepts are announce which is I can route to a network that I couldn't before, withdraw which is I can no longer see a network I once could and then update which is the path to a given network has changed. So the core routers of the internet are processing literally thousands of these messages a second and it's the thing that allows the internet as a whole to route messages around reliably even though each part of it is pretty much constantly in flux, I think it's so cool. So these graphs like you're not really meant to read them but basically the top one shows that pretty much every hour there's a peak of over a thousand messages per second and the bottom shows the kind of shape of the overall global BGP table changing over the last seven days. So BGP is what allows each router to know what decisions it needs to make about where to route packets which allows the packets to find a path from one place to another which allows a logical connection to be established between two locations so that blobs of text can be exchanged and fed into a browser layout engine to get your dang webpage. And so to you it's just pulling up GitHub. So that's abstraction in action on the internet. All right, so the internet's cool. I think the internet's awesome like just the way it works. So I could talk to you about that all day long but I didn't just shoehorn that into this talk but I kind of did but it will be relevant, I promise. So what is high availability? I'll tell you what it is not. I don't know if your bank does this. It's incredible to me that a top tier company like Chase would take these massive maintenance windows. It feels like they go down for like six hours every weekend. I really hope that the explanation here is something financial and not technical because otherwise I feel like that man they probably just gave up. So high availability is really the same thing as talking about fault tolerance. That is the ability of your system to avoid, detect and minimize the risk of failures. It's kind of an impossible task because you're essentially facing chaos head on and saying bring it or that I can react to the things that you're going to bring. That's one of the reasons I think it's such a fun problem to try and tackle. So how do we start? How do we attack chaos? Well, the first step, as with many problems that needs to be solved is admitting that you have a problem. Once we start accepting that things are gonna fail, then we can start thinking about why they are likely to fail. Once we've decided what is likely to fail, then we can start taking steps to anticipate and recover from those failures. That's pretty much the process of constructing highly available systems in a nutshell. You anticipate failure, you prepare for failure and then you react to failure, rinse and repeat or express those questions. Ask yourself what could fail in your system and then for each thing, decide what you're going to do about the knowledge that it will eventually fail. And then when things inevitably do fail, you figure out what you anticipated properly, what you didn't anticipate, how the preparations that you made performed and then what adjustments you wanna make for next time. So that'll be the shape of each of the primitives that we look at here. We'll talk about the anticipated failure, the preparation we made for that failure and the reaction we take when things fail. And you'll see that each of these primitives lives inside an abstraction, such that the rest of the system is protected from having to know the details of the steps being taken to prepare for failure. So let's start with redundancy. Redundancy is probably the simplest form of fault tolerance. If you know that something might fail, just keep several of them on hand. Although in a fault tolerance system, that means more than just keeping us closet full of spare parts. When we implement redundancy within a system, we use abstraction to allow the rest of the system to treat a collection of redundant copies as a single thing. So let's take a look at an example. So first up, when you think about, oh, what fails, it's really easy to say hardware because we're all software developers. So we all write perfect software. So yeah, hardware fails, hardware components fail. Yeah, yeah, yeah. Well, it turns out hardware folks actually have this kind of covered and they have it covered via redundancy. The modern server is actually a really great case study in redundancy. It's got a bunch of features designed to mitigate pretty much all of the common component failures. With RAID, we treat a group of disks as a single disk, allowing any of them to fail. With link aggregation, we do the same thing with network connections. And even the power can be made redundant with dual power supplies, or sometimes up to four power supplies per server. So in a modern server, you can literally snip any cable or pull out any hard drive and everything keeps working without so much as a hiccup. It's really amazing. It probably will just beep and email somebody. Hopefully, hopefully it will beep and email somebody. That's why you're snipping cables. So the hardware folks are doing the best they can to prevent our servers from dying of component failure. Now we actually have to face the fact that the servers are still going to fail anyways. This is the layer at which we, the application developers, need to start participating a bit more. A large portion of designing for fault tolerance is asking yourself the question, what happens if the server craters? And that applies to both physical and cloud infrastructure, right? What happens if this instance fails is the same basic question. So then, how do we implement the redundancy primitive at the server layer? We need to allow the rest of the network to treat a group of servers performing the same task as a single server. It turns out there are several technologies in the internet stack that we walked through where we can inject a strategic lie to accomplish this. The first is the venerable transparent proxy. It is a server that listens on a single IP address and delegates incoming messages to any number of listening servers behind it. The messages can then be processed by a fleet of servers that live on an internal network. Each of these servers processes messages as though they were coming from the proxy and then when the proxy receives the response, it transforms it such that it was coming from it. The world doesn't need to know that there are actually multiple servers living at this IP address and each server on the private network doesn't need to know about the others. It just needs to do its job. But the overall system gains redundancy and therefore reliability. In addition to simple redundancy, the proxy also provides another benefit, load bouncing. In fact, proxies performing this role are more commonly referred to as load balancers. This allows an overall system to handle more traffic than a single server would be able to alone. Of course, when you're thinking about high availability, it's important to understand the impact of losing a server in a load balancing cluster. In this case, for example, if we wanted to lose any one of the three app servers, we need to know that each server could handle 50% of incoming traffic. So how does a load balancer know whether it can route traffic to a given backend server? Often the proxy server will be configured to make something called heartbeat requests to each backend server. These are small requests made periodically with a specific response expected. If the response contains an error or is not received at all, the proxy removes that server from the rotation. So now we've got a pretty sweet setup where we can lose any one of our servers, well, any one of our app servers. So what happens when we lose a load balancer? So in our current example, the load balancer is what's called a single point of failure or SPOF. An SPOF is part of your system where the answer to the question, what happens when it fails is, oh, like that. The practice of architecting a highly available system is often an exercise in chasing out the single point of failures. So when you solve an SPOF at one layer, they're often crops one up at another layer. I think of them like the little air bubbles under a sticker that when you're trying to apply it properly, you can chase them out with a credit card with a very ordinary. All right, so let's see if we can tackle this load balancer problem. So you remember the DNS step where we translated the github.com domain name to an IP address? Well, it doesn't actually just respond with a single IP. It actually responds with one of at least three different IP addresses. So this theoretically could allow GitHub to use multiple endpoints to service browsers asking for github.com, each of which can be its own load balancer. And then using the heartbeat pattern that we saw, the load balancers can each be checked for their health and then removed from DNS responses if something goes wrong. So it seems like that's a great win. No more single point of failure, but there's actually a rub here, which is DNS is heavily cached at multiple layers. So that means that clients can potentially be hanging onto IP addresses that have been removed from the rotation upstream for quite some time. So in this example, that 128 has failed and been removed from DNS, but our browser still has the old value cached for five minutes and our local DNS server has it cached for 15. So DNS is still useful for some things, but for the thing that we're looking for here, not so much. Yes, I see you back there, chaos. So we have to go back down to the IP layer if we're going to deal with this, which means that we basically need one IP address to be held by multiple machines, which isn't really possible. So we still can get somewhere with a new primitive, which is clustering. Clustering is when a group of machines works together to ensure that a give it service is provided by a subset of those machines. In this case, it's just one of these three machines that we need to hold an IP address. The relevant technologies here are algorithms like Paxos and Raft and software like Pacemaker. So this is gonna reveal how dorky the internals of my brain are, but I sort of picture a cluster, like one of those gangster movie scenes where you've got three thugs sitting around a table with some sort of dirty job to do and they're all staring at each other like someone, someone here's gotta take care of that IP address and they're all sort of figuring out who's gonna do it. That sort of a scene, it's not quite accurate because that's usually like glaring and like silence, whereas a cluster in real life is actually the opposite. It's just constant chatter of basically, you okay? You okay? I think he's okay. Is he okay? I'm okay. I can see him. Can you see him? I'm okay. Everything okay? I'm not okay. So that's kind of the, that's basically the, it's an advanced protocol. It sounds like an advanced protocol, doesn't it? They actually do speak in advanced protocol. It's complicated and it determines who's alive in the cluster and they use that knowledge to decide whether one of the remaining members should take over the service or to decide who takes the service. So if you say server B is the currently active load balancer for the dot 130 address, if servers A and C both agree that B is dead, they'll hold a quick election. They actually hold like massively fast things called elections and they kind of work like elections to decide who should take the IP. Clusters need to work closely together like this because it's very difficult to determine. In fact, it's kind of impossible to determine as a single server whether you can't see somebody else because they're broken or you can't see somebody else because you're broken. And that's where clusters are coming really handy. The act of transitioning a service from one server to another is called failover. Usually in a case like this, it's qualified as automatic failover because it's not a human flipping a switch. I'll see more of that as we move forward. A fun little side note in this particular example, if you have a cluster of load balancers like this, you can have them synchronize their TCP state tables and you can actually perform a load balancer failover without dropping a single request. It's like one of the coolest things I've ever seen happen is watching incoming requests, seeing like the load balancer actually change roles and watching them complete without any clue as to what's happening. I think it's just amazing. But it is worth noting that the magic of that of zero request dropping load balancer failover is only possible because packet retransmission is built into TCP because what happens is when you do the failover, you do drop a couple of packets, but TCP, the protocol itself includes retransmission. So essentially the remote side probably misses a packet and asks for a retransmission. But up at the HTTP layer, nobody cares. Everything works fine. So timeouts and retries are incredibly important. From the perspective of abstraction, there are a little bit of a crack in the veneer. Ideally, like a client could always make a request and just trust that any failure would be handled downstream. But in the real world, that's just not the case and especially with HTTP services. I would bet you that a lack of properly set timeouts is probably one of the biggest causes of outages in modern web applications, which means that thinking about them in your system can bring you some potentially big wins. And now it's time to address the elephant in the room. It's a post scratch joke. So managing state, the toughest class of services to make fault tolerant are the ones that manage state of which databases are a prime example. The problem with redundancy and stateful services is that you need the overall system to agree on a single source of truth. And anytime you hear the words single, you kind of think single point of failure. And this is an area where chasing around a single point of failure happens quite a bit. The trade-offs in attempting to make a data store fault tolerant are expressed in something called the cap theorem, which we're just gonna go over basically. But it's basically summed up as consistency, availability, performance, pick two. So there are definitely a lot of different models for data storage that fall on different points of that cap spectrum. But most of them, all of them involve some form of replication. So replication is the process of continuously shipping data from one place to another. It's a way of gaining redundancy for a stateful service. It can either be synchronous, which means that a given piece of data is not considered to be written until it's definitely in more than one place. Or it can be asynchronous, which means that the write is acknowledged before the data has been copied elsewhere. So synchronous replication sacrifices performance. It takes longer for each write to make sure it got everywhere, to gain consistency, which means at any given point you can guarantee that a given piece of data is in more than one place. And then asynchronous is the opposite. You gain performance because you don't have to wait for multiple writes in multiple places. But you sacrifice consistency, which means if you cut the system at a given time slice, you can't be precisely sure which piece of data has made it where. So once replication is in place, in some flavor, other primitives like clustering, automatic failover, and even load balancing of reads can be layered on to allow your databases to automatically handle many different kinds of failure. So there are definitely tools out there for managing state in a highly available way, but it's definitely the most difficult part of any infrastructure. The final primitive of high availability I'd like to mention today is monitoring. Now, monitoring is a huge topic. There are entire conferences dedicated to monitoring. But in the context of this talk, I'd just like to point out two key places where monitoring is very important. One, failure is not always things outright breaking. Often it's a simple factor of a server reaching a perfectly reasonable limit and reasonably ceasing to work in response, something like a disk filling up. Of course, sometimes the limit is arbitrary and was set way too low for production. If anybody of you, any of you have ever had to deal with the default of passengers, max pool size, I've lost a lot of hair on that. But proper monitoring can tell you when a resource is getting scarce before it brings down the world. Again, so there's a ton of nuance to monitoring, but at the very least, you can catch the big things. Disk space and memory are the definite low-hanging fruit here. It's pretty easy to catch a server well before its disk fills up, and it's kind of embarrassing to have your site go down because a disk filled up on a server that you weren't watching. Entropy is a fun one. I don't know if you knew this, but computers have a limited pool of entropy that can get used up. If you're generating certain kinds of random data, like password salts and some forms of UUIDs, the computer can actually run out of entropy and block waiting for dev random. It's really fun, a fun problem to diagnose and it's way more common on virtualized environments too. Anyways, the second big point I wanted to make about monitoring is that it's gonna be the biggest tool that you can use to learn about failures you didn't anticipate. Keeping an eye on things like response times and system load can tell you a lot about potential failures that you wouldn't have thought about otherwise. And when failure does happen, looking back at the monitoring, you did have to try to figure out how you can anticipate it in the future is a super important part. So if I could boil everything down about designing highly available fault-tolerant systems into one question, it would be this. What happens when it fails? Ask it about every piece of your system, your app servers, your cache, the external APIs you depend on. I assure you, if you dedicate yourself to asking this question regularly about the systems that you work on, you will inevitably build things that are more fault-tolerant. High availability should also be at the top of your mind whenever you're making a technology decision. There are always going to be trade-offs, but there are inevitably going to be better and worse options from the perspective of fault-tolerance. Bringing up these questions early can help prevent you from getting painted into a corner with an unstable system because of earlier technology decisions where you weren't considering HA. I'd also like to add that I think someday we'll make enough progress such that most of us application developers won't have to worry about the details of building highly available fault-tolerant systems. Eventually, we'll be able to raise the level of abstraction of our tooling such that the vast majority of application developers, it's HA's just as straightforward as pulling up GitHub. And you see the HashiCorp logo looming there in the background, because that's pretty much the someday that we're working towards. It's probably a good time to mention I have HashiCorp stickers. So you can find me if you want one. But in the meantime, high availability is something I think that we should all think about. Chaos may be powerful and pretty dang unpredictable. Being chaos, you know. But we do have quite a few tools on our side that we can use to render them. And to that list, I would add perseverance, which is a word that turns out I cannot spell. Every instance of this word that you see from me has always been right-clicked. But perseverance, a Chumbawamba-like dedication to learning from failure is what's gotten us humans where we are today. We suck at things and then we learn and then we get better. It's a process that's not without its downsides, but it's worked out pretty well for us so far. Thank you very much. I look forward to hanging out with you guys.