 And welcome. Thanks for coming, everyone. I'm going to talk to you about distributed decision-making. The question is why should you care in the first place about distributed decision-making? Simply, it's the factor behind the reliability of Google, the reliability of the internet, the reliability of Facebook, of Akamai, of all the big providers that will ship you your web page, your Instagram feed, or anything you want at any time. Remember how often you actually went to your browser and typed Facebook.com just to check if the internet was still up. You're always trusted that it will be up. I used to be part of the Google team that makes sure this happens. It's called Site Reliability Engineering. And today I'm going to show you one of its core tricks, one of its core tools of the trade. So, what has we done? Let's see what is consensus. The internet is reliable because we have a lot of redundancy. Every core function is served by hundreds of computers at the same time. If any of these computers fails, another one will just take over and you won't even notice. Just to keep in mind, in many of your Google queries, actually one of the computers that should have served your query is down. The heart is broke, some cable broke. Maybe the entire data center lost connection because someone caught an undersea cable. Or this is actually something that happened to me. So redundancy is the key to reliability. But redundancy is useless without consensus because you want consistency. If you have a hundred computers that can give you the answer to the same question, you want to make sure they all give the same answer. If they give you a hundred different answers, it's no good. And this is consistency. And to reach consistency, these computers have to have consensus about the state of the world. So in all formal terms, the mission is agree on one single value with a bunch of computers. We have three goals for this. Consistency, convergence and availability. So consistency means they all agree on the same value. Convergence means they do agree at some point on the value. They might take their time, but they will get there. They won't get stuck on the way. And availability means no matter which of these computer fails, they will still reach consensus. There's no single point of failure. So let's look at a really nice and simple system. This is an account database. Just think of your account as any internet provider. You want to register your email address there. There are three servers in this system, the black boxes in the middle, that store the data for the database, and there are clients, the blue laptops here, who will try to create an account. And some of the clients try to create the same account at the same time. They will be trying to set different passwords for that account, of course, but we may only allow a single of these passwords to win. If we allow multiple passwords to get into the system, to be accepted for the same account, bad things happen, because you would be able to log in at some machines and not at others. So let's look at the simplest flow. This is monarchy, essentially, if you're in human terms. We just declare one of the computers to be the king. Because we don't like to say king, computer scientists say master. The servers go down there, they may serve answers, but they will never change the value that's agreed on, so there are minions. Up there you see messages. In a distributed system, computers can save messages to each other, but these messages might get lost in transit. The left client is sending the message dog. He would like to set the password to dog. And the master gets the password message to a dog first, so it says okay, don't have an account yet for dog, for you with this username. You're okay, I set the password dog, all is good. Then the second client comes to dog and says, I would like to set the password for user foo to cat. And the master server answers, no, come to. I've already set the password to dog before. Please go away. This system has consistency, pre-trivial. The master just makes sure that it's consistent. It has convergence. The master will just immediately set the password if possible. But it's not available if the master goes down or is lost. There's actually used to be state of the art for about 30, 40 years, up until the year 2000. And mostly system administrators were busy switching masters around randomly. Second approach, energy, if you will, you can just ask any server to set a password and the server will happily immediately set the password. So what happens now? The first client goes to one of the servers and said, I would like to set the password for the account to dog. And the server says, sure, go for it. It's set to dog. But before the bottom left server can tell anyone else about it, the bottom right server gets a message, I would like to set the account password to cat. And the bottom right server says, sure. And now we're in consistent state. What should we do? What's the right password anyway? So this system does converge. So to multiple arms, but it does converge. It's available because if any of these servers goes down, you can go to any other, but it's not consistent. Well, next slide. Let's tie the democracy. So any of the clients here, the blue computers, may become leaders of an election. For simplicity, we allow clients to become leaders. In a real system, you wouldn't do this for security purposes. So they all send proposals to computers. They can send proposals to one, two or three servers. That's their choice. And each of these servers will only accept a single proposal. So our client on the left sends the proposal to two servers, gets an OK from both. Our client on the right sends the proposal to two servers again, but gets denied from the one on the top, because the one on the top already accepted a different proposal. So the rule now is very simple. You have an absolute majority of OKs from the servers. You may go ahead and commit your value to the system. So the client on the left did win. Why? Because it got two OKs, and two is the majority of three, the current. So it can go ahead and send a commit message to all the servers in this voting group, saying, I want the election after right to send dog as a password. That's consistent. Good news. Because you can only get a majority once. And it's available, because if one of these servers goes down, you can still get a majority. But it doesn't converge. If three clients try to get a majority at the same time, and each of them grabs one vote, the system can't converge anymore. If one of these, like the client on the left, crashes or loses internet connection before it can send out the commit message, we haven't actually agreed on anything yet. But it had already gotten all the voters, so no one else can drive agreement on anything. So it also doesn't work. The solution to all of this is the Paxos algorithm. Paxos was invented by Lesser Lambert, who has worked on this field for decades. He's like, he's the guy for decision-making algorithms. And he's written an article about it without mentioning the word computer once. This is a totally non-technical article. Go ahead, read this. It's all about priests and ledgers. And if you don't know anything about computers, you can still understand this. And it's an incredibly subtle thing. You'll actually think about the paper for a day, and then you'll understand it. Hopefully my talk helps. So Paxos goes in three phases. First, you prepare a vote. Then you propose a value. And when you get enough acceptance for the proposed value, you commit it. And make sure that progress happens by doing multiple ballots. So every ballot is one round of voting, one of these three phases. And if one ballot fails for any reason, messages don't arrive, computers crash in the middle, whatever, then we can just start another ballot and we're good. So prepare a phase. The leader of the election on the right, on the left side, sends prepare messages to a crowd. And just prepare ballot number one, please. And all these voters, the service, they haven't promised anything to anyone yet, so they go ahead. I've got to promise you ballot one, and tell the leader that they haven't voted yet. So all is good. The leader got two promises that they can't vote in his election, so we go to phase two. The leader proposes. It sends proposed messages to all the service that have accepted its preparation. And it proposes the value of dog. And both servers haven't got any contradicting messages in the meantime, so they go, yeah, sure, it's okay. And the leader can say, yep, I got two okay messages, two is a majority or three, so I'm good. I have the majority. I can commit the value of dog to all the service. And dog is accepted. This algorithm has all three properties we originally wanted. Consistency, because you need a majority to win. Convergence, because we can restart this algorithm with a new ballot at any point to come to a conclusion. And availability, because if any of the three service goes down, we can still reach a majority and still reach consensus. And let me show you the recovery. So what happens if actually a client goes down? So in this case the client that just got the majority goes down. As you couldn't send out commit messages yet. So all the service, some of them have voted for something, some of them have not. They definitely don't know what the accepted value is yet. They still think we need to accept the value. So a second leader steps in. A different client. They just see, you know, some things don't seem to go forward and tries to step in and make them move forward. So at some point it will actually start a new ballot and send everyone a preparation message. You know, we're preparing ballot number two. And remember that the server on the top, it voted in the last ballot. So it will tell the client that it voted in the last ballot and it voted on dog. And that's the key part of pexis. Because if you had a majority before that voted for anything, if you try to assemble another majority it has to have at least one member that knows about the previous vote. And the leader of the election even though it would like to set cat as a password it has to honor the previous vote. That's the core rule of pexis. That's why it works. So this leader now knows it can go forward. It has enough promises from the different servers to go forward with its vote. But at the same time it has to propose the value dog. And so the entire system will converge to dog in the usual way even while the original leader has failed. And that's the magic of pexis. It can recover from these errors. That's the thing you have to keep in mind. So where's pexis used? On this foundation of choosing a single value you can build houses. The entire internet is built on this small fact. If you can choose on one value you can choose on a series of values. So a log. If you can agree on a log you can agree on database content. And the big, very high availability available databases like Google's Chubby, Apache Zookeeper or OSS at CD. These are virtually unknown. I guess if I ask who knows there will be like 10 hands in the room. But they are the core of the internet. If any of the important components like the load balancers like the master elections like the big databases like MySQL like Netflix, like Akamai and anything at Google wants to agree on which traffic goes where and which content goes where which server handles what they will use these three big databases one of these to back them. So whenever you go on the internet today this access technology is something you use probably multiple times even if you just open up Google.com. So take a thought for that maybe read the Paxos paper and I think there's a very nice thought of that to keep in mind. The most reliable thing that we could find in 50 years and a hundred years of computer science is democracy. It's open voting. I think that tells us something for our society. Thank you very much. Thank you Steve. So we have five minutes for the questions. So someone would like to ask a question about anything for example anything from the talk or about a Google make about reliability. Someone? Okay. Please. Thank you for the talk. So if I understood correctly the algorithm is basically decentralized and how it scales since servers grow and they are in different locations so how fast it works if servers are in Australia and in Canada you have to wait until they all come together what sort of approaches you take to make sure they reply fast enough. This algorithm needs five round trips to converge. So from the from the master to the from the leader to the most remote location you need five round trips. There back again there. So five messages of two and a half round trips. Traditionally you can upscale this in multiple ways. Like normally you don't make every server a voter because you don't need a thousand voters. If you have a thousand servers you have five of them voters and the other ones just follow these fives so called learners. That's one way to scale this system for example. Also typically not all data would actually go through a Paxos system. There's the very core, the heart of the system and then you just tack on multiple things to to make it work fast. But you're at the latency so the time for messages to arrive somewhere in the world is one of the core drawbacks of Paxos and one of the core challenges to tackle when you build such a system. Thank you for answering this. Some more questions? Maybe. Oh, please. Wait a second. I'd actually like to follow up on the previous question. So you said you choose five as voters. Would this be per vote or would you have to set fives or to be voters for all votes? In classic Paxos you have to know which vote, like all the voters in advance and that's core feature. These five voters can agree on picking different voters but you have to know who are the voters. If you can't know that in advance then you basically have to use blockchain. That was especially developed for not having to choose the voters in advance. Thank you so much. So we have time for one more question. Thank you very much for your talk. If nowadays everything that's up to the traffic routing and things like it based on the Paxos thing and the Paxos appeared a little bit later than the networking and internet what happened before it appeared? Was it magic or some other stuff? Half manual labor. To put this out of perspective, Google nowadays serves billions of users with hundreds to thousands of SREs. Sorry what it means. Side reliability engineers, thanks. Back in the day this would have needed tens of thousands of IT staff because every time something goes down you have to manually switch masters too. Back then they did master minion replication. So you have to manually switch masters to new masters when a data center was down. And this also broke all the time. Remember the times when there was scheduled downtime? That was mostly because someone needed to switch a master. So lots of hard work, blood, sweat and tears. Internet of all days, right? So I also wanted to mention what you can approach, Steve from the break or any of our speakers. So this is one of our things. If you have some questions just go directly to the speaker and talk to him. And please Steve, does that lead? So I would like to say one more thing. It's served us 20 years of Google today literally. I would like to congratulate Google from the 15x4 like they need because there is no Google in the physical representation I would like to congratulate Steve. Thank you. Thank you Thank you for making our life easier and collecting all of our data also.