 Okay. I'll begin to talk a little in advance. Just take the two extra minute where I'd like to introduce myself. So I'm Jean-François Guignet. I work at booking.com as a system engineer. I landed when I joined booking in the DBA team. I never touched my sequel before, so it was kind of a strange or assignment or strange place to land, but I really enjoy my time there. It's been over three years now, and I will talk to you about a war story that happened at booking, and I think it's interesting to share war stories. Not necessary to say, hey, like something bad happened, and we're bad if this is not the idea. Like, it's this really happened. It's production. It's the real world, and the mistake that we're done here, I think, is our learnings, and so I hope that you can take away some things from what happened. And one thing is that something happened at booking, and it was caused by automation. So automation is supposed to give us less work, not to break things and needs human to fix stuff. So I think there's a few interesting things there. So I will start by that. Well, we all know that saying, like, human make errors, but to really, really mess things up, you need a computer. Or in our case, a script, an automation script. So we'll come back to what happened. So a little about booking. So we sell hotel room on the internet. We're quite big, more than a million properties on the website, so there's more than a million hotels that you can book on booking.com. When I joined three years ago, people were celebrating 400,000 hotels. So now in three years, we did almost like we tripled that number. So it's quite an amazing place to be. And we're still growing. We use MySQL. We have thousands of MySQL server, and we use replication. And this is a story about one of our replication deployment. So this is what I'm going to talk about. You can all read or continue. So this is a typical replication set up at booking. We have a master where we write. We have slaves where we read. And a synchronous replication happens in the middle. We have intermediate master either for replication fan out or for being the point of contact of slaves on a remote data center. And we sometimes have slaves directly connected to the master. Quite a simple setup. Sometimes we have very small deployment, one local slave, two remote servers, and sometimes we have very big deployments. And this is an example of a medium deployment using Orchestrator. So Shlomi is not with us, but Shlomi also developed Orchestrator. And so we have our master. We have three intermediate master. Two of them are in remote data center. Dashline here meaning remote data center. And each of those intermediate master have 40 slaves. So there is about 120 servers in that deployment. I would call that a medium deployment for booking. We have, well, medium large. It's not a huge one. We have some that are bigger than that. But it's one of the respectable deployment we have. So we use Orchestrator. And Orchestrator allows us to visualize our replication deployment. It allows us to move slaves from one intermediate master to another one. And it also do some automation. It can automatically replace an intermediate master if it fails. So if we have that node that is failing, and at that point all the slaves there stop receiving rights, Orchestrator will see that and it will fix things. It will point the slaves back to other intermediate master or it will promote a new intermediate master. And also Orchestrator automatically replace a master when the master fails. So if, in our example, if the master fails, one of those, one of its slaves will be promoted as a new master. Which is quite interesting because when you deploy that as a DBA, you're not paged at two in the morning to fix things. Things get fixed by themselves. And that's good, right? It's automation. But Orchestrator cannot do that by itself. Orchestrator is good at repointing slaves, at moving, understand MySQL, but Orchestrator doesn't understand the way we use MySQL at booking. And the way we use MySQL at booking might be different as the way MySQL is used at GitHub. At booking, we use DNS for knowing to which server we need to write as a master. Orchestrator doesn't know which server we use DNS. So Orchestrator, when it does a failover, call a script, which is one of our script at booking, to repoint DNS. When it does that at GitHub or at your deployment, it will call your script that will tell your application where to find a new master. And this is the script that was a problem. So before looking into the problem, we will look at the deployment where we had a problem. So it's a very simple deployment. There is four database server, A as the master, B as the local slave, X as a remote data center intermediate master, and Y as a remote slave. So this is a small deployment at booking. The DNS of the master is pointing on A and reads are happening on B and Y. So at that point, everything goes well. And now we have the first event, the first failure that happened is that both A and B failed at the same time. So I will explain how this happened later. But at this point, we are not able to write to the master anymore and reads on B are failing. Sorry. But orchestrator is there, right? So orchestrator fix things. At this point, it doesn't have to do a lot of things. It doesn't need to repoint slave. It just needs to call the script that repoints DNS. So at that point, we can write again and to fix the read that were happening on B. So that was actually done manually. But this is not such a big, as a big problem as not being able to write to the master. So now we're back in a working situation. Writes happened on X, reads happened on Y, and everything is fine. So the day ends, we go back home and something happened during the night. And I wake up to this. So that was Saturday morning, a very, very bad morning. You wake up, you check what's happening, and you realize writes are happening on B. Okay. What happened? Let's keep looking at things. Reads are happening on Y. So we are not reading what we're writing. So a very, very bad position to be in. So that's obviously not good. So B has outdated data. So all the write that happened on X didn't reach B. So if we're reading from the master, we are not seeing that data. And all the new writes that happened on B do not reach Y. So when we read on Y, we are not reading the data we have written. So it's really, really bad. It's not a split brain. Not a split brain as itself, like there were not two active master at the same time, but there was half data on the left and half data on the right. So I will explain how we fix that later. But now let's see about what actually happened. So after the failure of both A and B, a second failure was detected by orchestrator. And I know that because I have the logs of orchestrator. So I now understand that after everything was fixed during the night, A and B were back up. So they resuscitated in a way. And at that point something made A fail. And when A failed, something pointed DNS to B. And it's the failover script that when A failed, the failover script stole the DNS entry from a very valid server X and pointed it to B. So now we know how that happened. But why did A fail? So when they came back up A and B, they had outdated data. So my initial idea, because I was there when A and B failed, was to bring B back up first and then to slave B under X. And that would have been fine. But because A came back up, it made that impossible because A inserted pseudo-GTID, heard B events and things like that. And so B now had data that is not on X. So at that point, I decided to recover from that and then to put A and B back under X by cloning them using extra backup. But if I would have cloned B, there wouldn't have been any problem. But I chose to clone A because at that point B, I was still using my previous plan of pointing back B under X and I was planning to delete A at that time. But I didn't take into account that things changed. So that's the first human error and it's my mistake. I will not do that again. But it's not the only thing that caused this problem. When I did that, that was in the evening before I wake up the next morning on that disaster. And at that time, everything was working. I was still looking at things and there were no problem. So B was actually promoted as a new master from A. So orchestrator detected the failure of A hours later. And this is because I only downtime A for four hours. When I reclone A, it's okay, I don't want anybody to be paged on that. I downtimeed the box for four hours, which was a mistake. I should have downtimeed that for a week. And so this is human error number two. And still description should have realized that that failover, well, that DNS repointing shouldn't happen. So orchestrator has some mechanism to avoid flapping. And it will not fail over a second time the same chain in a short amount of time. And I think this is also human error number zero, which is not mine. It's debatable if it's a mistake or not, because orchestrator needs you to acknowledge a failover before another failover will happen. So if A and B are not recloned, is it good or bad to acknowledge? Well, it was acknowledged in that case. So maybe it shouldn't have there's argument both for acknowledging and against acknowledging. And so there are actually seven things that cause this problem. Two servers failing, human error number zero, an edge case recovery, A and B coming back, me recloning the wrong server, me not downtimeing that server enough. Orchestrator probably failing over something that it shouldn't have, because a human would never have triggered that failover. And the DNS script that stole the DNS entry from B. So in all those instances, I think there's a takeaway. I'll list that later, but you can already think about that. What could be done differently to avoid each of those seven problems? So let's talk about our fancy failure. So two servers failing at the same time is very unlikely. So was it a deployment error and B in the same rack? Or were we very unlucky? And actually it's both. But on that day, in about less than two hours, 10 to 20 servers failing at the same time in the data center. And so this is because there were human operation in the data center that people were racking new servers. And because of a very sensitive hardware, there was a low probability when you were racking a new server in the same rack that it was shutting it down another server. So my point here is double failure happen. Don't necessarily automate for them, but be ready for them to happen because 10 servers failing in half an hour is possible. So how do we fix that? So once we're in that situation where we have data on X that is not on B and data on B that is not on X, either you declare part of that data lost, which we did in that case because that wasn't really valuable data, or you have to replay the data. So replaying the data needs understanding what data is there. So DBAs do not necessarily, DBAs and booking were a small team, 10 people serving 600 developers. We cannot know all the data they're putting in the database. So we are not able by ourselves to recover that data. We need the developer to understand that data. And so it was easier here to chose the way to just drop that data. Replication, we could have used replication. So replicate X, like we could have slaved, we could have put replication here. This would be a master to this and send the data back there. But at that point we got auto increment in our way because when we failed over, so some auto increment were consumed on X, and when the DNS was point back to B, the same auto increments were allocated to rows. So we had conflicting auto increment. And that made me think, okay, we use auto increment all the time, but is this a good thing? Shouldn't we use another type of ID? So those are my takeaways from that. It's like really twisted situation happened. Like automation is not simple. There should be no shortcut in an automation script. If you write an automation script, like test every precondition, like okay, it looks stupid to do that test, but maybe it's a good thing to test that you are actually moving a DNS entry from the right server. Like is that DNS entry really pointing to A at that point? No, it was pointing to X. Stop, I need a human to think that through. Premature acknowledgement, be mindful about that. Two takeaways from myself. Down time more than less. And shutdown slave first. And something that will take more time is maybe rethink the way we use auto increment and maybe using UUIDs. So it's UUIDs are usually generated randomly. There's some tricks and there's some algorithm to generate UUIDs to make sure they are generated in order. So you end up still inserting in primary key order. And you should use those tricks if you want to keep the performance of inserting your table. Maybe there's also something in Orchestrator because I'm pretty sure humans wouldn't have failed A in that case. So Orchestrator should probably do something about those kind of failures. It's still the one percent. So I'm not sure it's worth a lot of time. But so we have to think maybe a little more about how to detect those things. And if you have thought maybe open an issue or come discuss it with me after the talk. So some links. And I have three minutes for questions. Yes. Yes. So in that case. Yes. So the question was why should we stop slave first. So when we were in that situation here I decided to stop A and reclone A from extra backup here. I did a backup from Y and I reclone A from Y. But at that point I shut down a master. And so this box was known as a master for Orchestrator. So I should have shut down B first. Not A. So if you shut down a master be mindful about like does that box have slaves. Yes. Yes. But which one is it. Like so the question was after failing over should you kill the box. Well in that case Orchestrator detected a failover of A. That's fine. Killing A wouldn't have solved the problem. It's the problem was that the DNS repointing script took that DNS that wasn't belonging to A and made it point to B. And a solution could have been to kill X at that point. But it's also at that point if your script is detecting that it's failing over something that is not belonging to the box that actually failed there's something here. But it's not what you mean. Yes. Well actually at that point those box were dead. Their uptime after was like a few minutes. Like those two box rebooted. So it's like basically you can think of that as like the power cable was pulled from them and it was plugged back. So like it's at that point it's really hard to definitively kill A and B. Unless you make sure that you cut the network cable or you have an out of band ways of shutting down the switchboard which could be a good solution. Like that could be something else we could do in automation. I'm not sure it's the best thing to do but somebody that is paranoid about those things. I usually think of myself as paranoid and that's not an idea I would have. So it looks like you're more paranoid than me. So the question was I got a call. So on Saturday morning I just logged in and because I knew there was cloning on the way I checked things. At that point I wasn't called. There was a log of the orchestrator failover at midnight. At that time I was sleeping. So Saturday morning was detected because of my curiosity. Not because anybody was paid. Actually it's a good thing that I was curious because the longer we would have waited the longer the more data we would have lost. So thank you very much. If you have more questions I'll be around. I'll be at the community dinner. So grab me and thanks Frederick for organizing things.