 Hi, everyone. Thank you for coming. My name is Shlominov. This is an unsplit brain MySQL session. I'm with the GitHub Database Infrastructure team. I guess you know GitHub. The incentive for this session was an unfortunate outage that we took a few months ago. We had a split brain scenario, and I would like to quickly illustrate MySQL-wise what happened, and then discuss the rest of this session. So what we had basically is a data center network petition problem. An entire DC went down petition. So imagine we had this cluster, we had multiple clusters of course, but let's discuss just this one. Wow. We had this database cluster with this master, and the master and the replicas got network isolated. Our orchestrator failover mechanism kicked in and failed everything. All the databases too. So all the masters, all the writers failed over to the other DC, to the failover DC. So orchestrator promoted this half of the cluster and made this the master. But during that time when this was network isolated, the old master still took some rights from local applications. So when this was promoted, and the entire data failed over to using this master, this already diverged. Now this was expected behavior. It was actually written, we knew about it, and that was supposed to have been fine. But unfortunately, and for reasons outside the scope of this session, we had to fail back to this cluster, to this DC. We just had to. By the time we realized we have to fail back, there was like half an hour's worth of rights on the newly promoted master. We couldn't just throw them away, because half an hour is a lot of production traffic. Now this master only took a few seconds worth of bad traffic, right? Maybe just a thousand transactions or so. It's not too much, maybe a half minute, depends on the master. But they diverged. We couldn't fail back without losing the data, and we couldn't set up replication between the two because they diverged. They would no longer cooperate. So the big outage time was due to the time it took us to restore from back up all these dozens and dozens of servers in this DC and catch up with the replication and et cetera. So you know the deal. The question was could we have somehow just rolled back these bad 1000 transactions and magically amend the database to be back in sync? And so I would like to introduce Github's MyScale Rewind. That's a new tool that does just that. What it allows us to do is given a partitioned or split brain scenario. I'd be able to take one server at a time and connect it into the healthy cluster, and it will be fine. It will be fixed. It will be replicating. The cache will be warm and we will be happy. So I'd like to illustrate how this works. This is the session illustrates the steps to making this work. So there are a few components to this solution, and one of them, the first is GID. So we run Oracle MyScale GIDs, and GIDs have many properties. I'd like to illustrate one of them. One property of GID is that each server remembers forever the entire change log or the entire set of GID IDs that have ever been applied on that server. So you can expect these numbers to be in the millions, in the milliards, whatever. A server has its own UUID, and then it says, yeah, I have GIDs one, two, something. But then as binary logs get purged, the server also maintains the identities of the GID IDs that have been thrown away just for the records. But it keeps them in the records. It knows that sometime in the future it has applied those entries and they're not here today, but I know about them. This property is very helpful to us. See, let's look at the cluster before the outage and then at the two split frames. So before the outage, the master applied some binary logs. And after the split frame, the old master applied some extra bed writes. Those are the ones we want to get rid of. And the newly promoted master took normal production traffic. It's the green one over here. And both keep track of exactly what has been applied. And it turns out I can mathematically subtract do this minus this to get the identities of the GID entries that have been applied on this server and not on this server. So the new master used to replicate from that one. And so all this history is valid. But then we just need to identify what parts are the GIDs. So basically it's a subtract. I do GID executed of this master minus this one. This is the set of GIDs that are the offensive entries. So now we know which transactions were bad. The next thing is that we're using row-based replication. And with row-based replication and with bin log row image equals full, we have this thing in the binary logs where if you run MySQL bin log and variables, et cetera, each entry in the binary log, each update, delete, and insert tells you the entire image of the row before the change and the entire image of the row after the change. So for an update, that means all the columns before the change are listed with their values, all the columns after the change are listed with their values. The same for insert and delete. And that makes it possible to know what you need to revert. If you just replace these two, you basically revert the operation. Now we don't need to do that by hand. MariaDB has the flashback option. This is a contribution by Alibaba, contributed many years ago, and incorporated by MariaDB. So in MariaDB's MySQL bin log binary, you have this flashback option. And what it does is as follows. Imagine this is the normal binary log. You have inserts, you have updates, another insert, a delete. If you run MySQL bin log, that's what you'll see. If you run MySQL bin log dash, dash, flashback, what you'll get is the image on the right. It's the reverse order of anti-statements. It's the undo log. The first insert here is the anti-statement of this delete. This delete is the anti-statement of this insert. This is the anti of this update, et cetera, et cetera. So if I apply this binary log, and then this, I've done nothing in effect, right? I've nullified all the operation. And that's the basis for our rewind tool. So what we have so far, GDAD tells us what transactions are bad. Flashback gives us the mechanism to undo transactions. But we're still missing a lot here. We're missing the connection between the GDADs and the binary logs and the entries and the entities in the binary log. So let's look at this again. We have the two masters. And one has the verge from the other. We want to take that bad master, this contaminated master, so to speak, and move it back in time. Ideally, we would move it to exactly the point of the split. But it is also valid to move it a little farther in time. Because any point on this line is consistent. It's OK. We waste a bit of effort. But it's OK if we take it a little farther back in time. Now if we look at the binary logs on the contaminated master, those bad GDADs are somewhere within the binary logs. They could be within a single binary log, or between two or three, or whatever. Turns out it's pretty easy to identify where these entries are in the binary logs. Every binary log has a header that says previous GDADs. So at the beginning of every binary log, it says what have been all the GDADs thus far. This is my history. And now I'm a new binary log. And this is what's going to happen next. And if you take all the existing binary logs and parse the headers from all of them, you can slice it. And you know exactly what GDADs are executed within each and every binary log. And so we're able to figure out that the bad GDADs are within, say, these two binary logs, 622 and 623. And so the next step is we identify the bad GDADs. We know where they are in the binary logs. We generate, we use myScaleBinlog-flashback to generate the anti-binary log of these two binary logs. We generate the anti-binary log of 623 and the anti-binary log of 622. And then we apply them to myScale. Does that make sense? No, so simple. Because MariaDB's myScaleBinlog has no notion of myScaleGDAD. It's a different product. It doesn't speak myScaleGDAD. But we need myScaleGDAD to tell us what went wrong. And so we need to do a little bit of hacking like this simple Orc and said strip, which I clearly don't need to explain, it's self-explanatory. And we do a little bit of hacking. We inject some GDADs and we throw that back into myScale and we literally move myScale back in time. So far so good? Yes? Not so good. We moved back in time. The big question right now is, where are we? Is this the Renaissance? Is this the French Revolution? You know, all these movies, you move back in time. You need to figure out where you are. We are not in a good position right now because we can't just join back the replication stream as yet. If you look at the binary logs right now on that demoted master, we had those normal entries. We then had those bad entries. We just applied more weird entries. The data set is consistent with some point in time, but the binary logs do not agree. The GDAD executed on that master thinks it's someplace completely different, like something is really, really messed up right now. And so now what we need to do is to match this with the data set. We need to tell our time machine, yes, it's now 1995, right? You've jumped back in time to this and that date. And it turns out we can actually do that because we rolled back, like say, we rolled back two binary logs. We know what GDAD entries are in these binary logs. We can compute the difference between the current state minus the entries in the binary log and predict and project what would be the time ETA where we land. Does that make sense? Okay, before we continue, and at this point, at this point we can just do a reset master and set GDAD purged into that value. And we erase my skills memory. And it thinks it's now in 1995 and it's up and running and it's able to change master two, master O2 position equals one, join the replication stream, a little bit of replication lag, whatever, but then it's healthy. It can replicate and it's fine. It's in sync. Okay, a few limitations, but we have more to talk about. So limitations are you cannot rewind DDL. It can be done manually, it would be painful, but if someone altered table during that time that doesn't roll back, it's a mess. My SQL and Flashback neither support JSON or point or the newer data types. So that won't work. We do roll back a little bit more than we need to because we move like, we roll back complete binary logs. It's much more convenient and I think a bit safer to do that, but then it gets us farther back in time than we strictly need, right? We go way back, which means we de-apply a lot of transactions that we don't really need to de-apply, but then we need to reapply them so it takes more time. Currently, that tool runs on each and every server, but yesterday I got some nice comments on how that can be improved. So the big question now is this. Look, we took two different tools that don't talk to each other and don't speak the same language. We have Oracle and GDADs, we have Flashback that doesn't talk about GDADs. We threw in some macaroni, he said, in org script to make that look as if they talk to each other, threw that at my skill, and then did some mathematical computation to predict the time at which we will land and say, yes, the data matches that time. Does that make you very confident? Would you run this in production? Sure. Whatever. So yeah, we actually have this being tested in production continuously daily as we speak right now. This is being tested in production and continuously reporting success or failure, hopefully success. So I'd like to illustrate how this works. The basic idea is that ideally I would like check some of the data on the server, do something bad, rewind, check some of the data again, and we'll find the exact same checksum. Makes sense? There's two problems with that. First, my dataset is too big, like checksum in the entire dataset would take days that's not very friendly for testing. So if something does go wrong, it will take me a couple months just to figure out what happens. The other thing is that we're taking a checksum right now but how can we predict that the time travel will return us to the exact same point? That's a bit of a problem. So we did figure this one out and very quickly it goes like this. We take a replica, it's a testing replica, it's not part of production, but it's a replication in the cluster. We stop replication, we issue a change master tool which resets the relay logs, it clears the relay logs, we rotate into a new relay log, we flash the binary logs, we rotate into a new binary log. Everything is clean. We call this point zero. We grab 30 seconds worth of replication from the master, from production. But we don't apply it, we just run the IO thread, not the SQL thread. Now our relay logs are full with real actual production, 30 seconds worth of data. The next thing we do is that we parse these relay logs and figure out what tables are actually being affected in these 30 seconds. These are not all of GitHub's tables. We can reduce the problem by only, we don't need to check some tables that aren't being affected right now. And so we analyze the tables. That's also too big. We only take reasonably small tables and one random big table which leaves us still with like 30, 40, 50 tables each run which we decide, okay, these are the tables being affected in the relay logs and these are the tables we're going to check some that's going to take like between minutes and a few hours to check some of these tables. We take a check sum. The next thing we do, we kick the SQL thread, we apply those changes on the server which generates binary logs. Okay, so far, good. Now we've applied 30 seconds worth of production data. Next thing, for each of those tables we listed before, we randomly delete 10 rows. We just do damage, right? We just, whatever, we do damage. We try to delete rows from the end of the table because those are like high contention rows which are likely to be used shortly like the damage is bigger. And just to not necessarily strict it but for fun and glory, we also say, okay, now that we've corrupted the data, let's start replication for fun and try to apply even more production data. For us, this is like, within a second or two of these breaks because there will be a replication problem, right? We just deleted the row, someone else is trying to update it, replication will break. We've made a mess, everything is broken and this is where we throw jage muscular rewind into the game. And it fixes, it uses the algorithm we described earlier, it fixes the situation and moves us back in time. Now, we intentionally rotated the logs and we intentionally grabbed an amount, the reasonable amount of production data that we can predict that it will return us to exactly point zero, the exact point where we took the initial checksum. So whether our prediction was correct, we'll see. Now that it has rewinded us, we check some of the tables again. We expect to find 100% match to the original checksum that we took before the test began. And we do, this actually works, this runs continuously all the time, right? The test is complete, we put the replication back into the cluster, it catches up with replication lag, we do it again and again and again and again and again. This gives us the confidence that the tool is actually doing what it's supposed to do. Cool, we don't really want to use that tool, ever, ever. We don't want to be in that situation ever again. It was not a good place to be. So we're really hoping to not do that, we are moving in multiple directions to never be in that situation again, to have plan B, C, D, E and F to mitigate the problem, right? But if we do, we expect this tool to help us recover a server within minutes, as opposed to many, many, many hours. And the server is up and the cache is warm, so it's going to be very good on replication lag, et cetera. So those are our expectations that we never expect to see again. The status right now, this is a shell script. It's just the 200 lines of shell script just using Oracle's MyScale Beanlog in Maria. The B is MyScale Beanlog and some Shell and Oak. At some point in the future, I'd like to incorporate that into Orchestrator, which is a little bit of a problem because I'm using external tools to make that happen. And fortunately, there's ongoing work, I don't know, safe harbor, et cetera. But I happen to know that, and this was presented yesterday, that Oracle have begun some work into this, so you would be able to say, what are the GDID entries in my binary logs? And let's undo, select, undo transaction of a given GDID set. So any client could do that, and that would be really awesome. So I really want to ask Oracle to kick one work in that. Please, please, that would be really nice. And with that, I'm done. I have time for a couple of questions? Yes. Questions, yes, sir. Thank you. Well, what I did was roll back everything. I, sorry, okay, the question was if we have a conflict, what about conflict resolution? If we had two transactions updating the same table. What I did is I magically brought up an image from the past and made MyScale's memory, I reset the memory and said, yeah, you're in 1995, right? All memory of those two transactions is lost. They do not exist anymore, right? The binary log is linear, right? It's serialized. It was printed in a serialized form originally, and we de-applied it in a serialized form. So there is no concurrent transaction resolution needed here, so to speak, okay? Yes, sir? Yes, yes. Okay. And so the way I figure it out in ReplicationManager is by ZFS or Snapshots. Okay, so the gentleman says, we already tried this with ReplicationManager, which is also a failover mechanism, and there was feedback saying, yeah, DDL doesn't work, and you roll binlog format, and so you're using FastSystem Snapshots to fix the problem. Then ZFS. Okay, ZFS. Cool, yeah, we use binlog, we roll binlog formats, and we're willing to get this to try. So, yeah. Last, no, last question, please. Is there any way you could have done all this magic without the TTID? I don't know, because we do use GDAD, so my state of mind was, let's try and use that. I don't know. Okay, thank you very much.