 I'm Jean-Prancois Guagnet. You can read my bio on the FASDEM website. I want to take as much time as possible to talk about this subject and not me. So I'll start right away. So the title of the talk is the consequences of sing bin long not equal to one. Actually the full title of the talk should be the consequence of sing bin long not equal to one and of EnoDB flush log at TRX commit equal and not equal to two, sorry, because like one without the other doesn't really make sense. Like if you still sync EnoDB and you don't sync the bin logs, like you're still adding the cost of a sing. There might be edge case but like that's what we're talking about. And I will use in the talk TRX commit for short because like this is just too long to write in slides. So in the abstract of the talk I say that it's faster but we'll see by how much. Then we'll talk about what is replication and what are those two parameters. I'll talk about avoiding setting sing bin log to a value different than one. Then the consequence mitigating and closing comments. So this talk is mostly about my SQL 5.6 and 5.7. Most of it also applies to Edo. Some things about MariaDB and I will explicitly mention it. And like don't waste time taking pictures of the slide there online and you can follow the slides right now on your phone if you're not able to see the lowest part of the slides. So faster by how much. So like we're in benchmark land here. So I did I use sysbench single-threaded insert benchmark without secondary index because I really want to test like the syncing part of my SQL on Google Cloud with SSD persistent disk my SQL 5.7.26. So sing bin log equal one and TRX commit equal one. 200 transactions per second single-threaded. This is not a lot. On a slave it's a little more because here my client is remote from the master. So there's around trip time between the client and the database. On a slave there's no such thing. If I reduce durability so sing bin log equal zero to X commit equal two. 3.7 thousand transactions on the master 18 times faster and 7 thousand transactions per second on a slave 30 times faster. So obviously people having a replication lag problem. This is a very tempting configuration to do to solve a replication lag problem but it has consequences we'll see about this. The whole benchmark setup is discussed on one of my blog. So this is like this is what we're talking about. This is the speed difference we're talking about. Quick overview of replication a transaction commit on the master. It goes in EnoDB and in the binary logs. Those binary logs are downloaded by the IOTRED. Sorry it's the lowest part of the slide. Are downloaded by the IOTRED that connects to the master. Download the binary logs put them in the relay log. The SQL thread execute the relay log put data in EnoDB and eventually the slave produces binary log if log slave update is enabled. So sync bin log and TRX commit they're here. Those are per meter controlling what happens on EnoDB on commit and what happens on the binary logs on commit and they're also here. So sync bin log equal something zero to any number. Well I guess it's limited by a size of an int. So sync bin log equals something. The bin logs are flushed to disk every something transaction. So if sync bin log equal one the binary log will be flushed to disk so that means the data is actually on persistent disk not in an operating system buffer after each transaction. If you set this to zero the bin logs will be in the operating system but you don't know if they're on disk. They will actually be flushed at every bin log rotation. There's some problems about this but I won't go into the details. So this controls if our bin logs are persistent on disk or not. TRX commit can be equal to one, two or zero. One the redo log so the log of EnoDB is flushed to disk after each transaction. Two the redo log is written to the operating system so it sits and the data sits in an operating system buffer but it's not flushed on disk at this point. It will be flushed on disk every second for EnoDB housekeeping. And zero we're not really talking about this. It's like we avoid the system call at each transaction. The data is in a MySQL buffer in an EnoDB buffer and it will be sent to the operating system once in a while every second. So that's what those two settings are controlling. So I mentioned flushed to disk. So this means that the data in an operating system buffer is actually persisted to disk. So this is not fast. So in old spinning rust disk a flush was taking 18 to 25 millisecond so that means we can do only 40 flush per second. So like 40 transactions per second that's not a lot. Consumer grade SSD so what you probably have on your laptop right now it's very variable a flush can take up to 10 millisecond and on some of the SSDs at 0.5 millisecond just do an IO ping on your laptop you'll know how much time it takes. Like enterprises SSDs. So all those are local disks like one tenth of a millisecond. If you have a RAID cache with a RAM battery backup RAM it's 0.04 seconds. But we don't have only local disk. We have network disk, fiber channel, iSCSI like any type of SAN like that's the network latency 0.5 millisecond to 1 millisecond depending on the efficiency of your network. And in cloud environments disk are remote. So GCP or AWS like between 0.5 millisecond to 1 millisecond to go to the storage. In AWS you can have local SSDs which are efficient and that's fast. So like flushing to disk is not fast. And what does this mean? So what does this mean is that every transaction is durable like same bin log equal 1, TRX commit equal 1 after each commit transactions are on disk so we don't lose anything. But if it's equal 0 or equal 2 if mys equal d crash then the data is not lost because the operating system is still alive as the data in the buffer. But if the operating system crashes we lose transaction and also maybe EnoDB and the binary logs are not in sync. So here if we do transaction ABCD and so on maybe the binary logs are synced up to E and EnoDB is synced up to K and then we do other transactions and we crash at recovery the binary logs will contain up to E and EnoDB will contain up to K. So we have a problem of consistency here. And there's also the other scenario where the bin logs are ahead of EnoDB it's less there's a less probability but it can also happen. Skip this. So my SQL in 5.7 or 8.0 by default are the defaults are not now safe. MaraDB defaults are not safe. I'm not it's not the best like the choice here is theirs. I'm not sure I agree with this. So we would like to avoid setting send in log to something different than zero. So the solutions here are either to get faster disks or to run things in parallel. And so a single sync will persist many transactions at the same time. So this is a binary log group commit on the master or if you use parallel replication. So if you use if you run more than one transactions at the same time on slaves maybe you'll be able to commit many things at the same time. So we'll explore this a little. So we remember the numbers. Very low transaction throughput with sync bin log equal one. And as we use more and more tread we're able to get a better transaction throughput with same bin log equals zero and then single treaded. We have a more decent transaction throughput. This is not with replication. This is with a client doing a roundtrip. So around 3000. But this this curve grows much faster than the other. So even if we're able to get a better transaction throughput with multi-treaded on the master like we need a hundred two hundred treads to get a better transaction throughput and send bin log equals zero we have better. So when you're looking for the best transaction throughput like send bin log equals zero is still better. This is with four VCPUs in Google Cloud and this is with 16. So it scales a little. With replication Meridibi has something called slave group commit which is pretty exciting. It basically gets the transaction throughput of send bin log equals zero with many treads without the consequence of it. You can look it up. I have a link down there and parallel replication on the benchmark I'm doing on the insert benchmark is like this is this is the worst benchmark for for parallel replication because there's a lot of contention. Sing bin log equal one we scale but with sing bin log equals zero it's still better. So like we're still like tortured here if we want the best transaction per second sing bin log equals zero is still our best choice. Those are all the graphs on the same page. So now about the consequence so now we get into the real subject of the talk this is a copy from from the previous slide so just remembering transaction ABCD bin logs are sing up to E EnoDB sings up to K and then we crash. Maybe it's a time for one question here before I move forward. Everything is clear so far? Okay, so let's see about the consequences of this. So if the master crash so we're only interested in operating system crashes like if mysqld crash we don't lose anything everything is in a buffer. So if a master crash and we're using legacy replication so file in position. So the slave is pointing in the binary log that will vanish. So after the crash mysqld will create a new binary log and if we keep writing data will be appened to the new binary logs but the slaves are pointing to something that is now gone. So during the crash the slaves are not able to connect to the master but after the recovery the slaves connect to the master and they connect in something that doesn't exist and then replication breaks. So that is obviously not cool because now you lose all your slaves. You can still write to your master which is kind of inconsistent. EnoDB and the binary logs are not in sync. You lost some data. The slaves are actually more data than the master because they didn't lose the last transaction. The master did. So you are not in sync and with broken replication this is really, really not cool. And if you have lagging slaves that are pointing in an old binary log they will keep going but they will have corrupted data. So everything is out of sync pretty bad. If you're running with GTID so let's say at the moment of the crash we have GTID up to 60 on the master and the slave adds up to 58 and then we crash. Now the master goes back up and because the GTID state is stored in the master the slaves start back in the binary log. The master starts at GTID 50 and so it will write 50, 50, 152 but the slaves are already 1, 2, 58 but those are the old transactions and now we have new transactions that have the same ID. So two things will happen here. Either the master will be able to write up to GTID 58 fast enough before the slaves reconnect and then everything will work but will have corruption or the slave will be faster it will reconnect while GTID 58 doesn't exist on the master and then replication will break because the slave has something that doesn't exist on the master and that will break. So that's another scenario where things break here and that's what I just explained. So that's in the case of an operating system crash. If the slave crash and we are using file and position replication if we are using crash-safe replication basically the replication position is in EnoDB so EnoDB will do crash recovery that information will be consistent and then the slave is able to restart to reconnect to the master. It's still as binary log that we cannot trust because now they're out of sync but there's no data corruption in EnoDB in this case. So if you're running with GTID with binary logs disabled on the slave which is a feature of 5.7 the position, the GTID position is also stored in EnoDB so there's no problem here. The data is safe, the binary logs on the slave obviously is out of sync with the data but the slave will reconnect to the master and the data of the slave will be consistent. But if you're running with binary logs enable because the position of replication is stored if you're running a slave with GTID and binary logs enable 5.6 or 5.7 the GTID position of the slave is in the binary log we crash, we lose binary logs we're out of sync between EnoDB and the binary logs and so we start replicating at the wrong place so again data corruption either replication will break or we will have silent data corruption so those are the consequences of running with this parameter. Something in MySQL 807 GTID position in EnoDB, unclear there's hardcore reading about this if you want more details and now we are at mitigating this so if you're running with sync bin log not equal to zero basically there's one thing that you need to remember is that either on the master or on the slave after an operating system crash you cannot trust the bin logs like obviously you didn't sync them you cannot trust the bin logs so what can you do in a situation where you cannot trust your bin logs so on a master after restarting MySQL after an operating system crash what I do is I make sure the master will restart in offline mode equal on so no clients connect to it no slaves no clients and then I decide as a DBA what to do the best thing is to fail over to a slave but if I do not automate failover if this is manual I do not want clients to write to my master and I do not want slaves to connect to it and here I arrive as the DBA I check if this is an operating system crash in which case I need to failover or maybe it's just a MySQL D crash and then I set offline mode equal off and then I continue because at that moment I didn't lose anything on slaves so on slaves if MySQL restart after such a crash if you have binary logs on the slave like you need to purge them like you need to completely delete the binary logs because now potentially you have a hole in there if the slave will eventually be a candidate master you cannot trust its binary logs so wipe the binary logs on a slave if you're running with GTID a reset master will erase your GTID position so you need to restore this Intermediate master are both master and slave so you have to do the two things when you're running with GTID on a slave with sing bin log not equal to one so GTID replication is not crash safe so normally what you would have to do is you would have to restore a backup and this is very annoying for a DBA to restore a backup so there's a way to avoid this it's because file and position in a table can be trusted at that moment so if you're running crash safe replication not with GTID with file and position there's some voodoo you can do here either if you're running single-treaded or if you're running multi-treaded with slave preserve commit order there's some voodoo you can do here to avoid restoring the backup and so the idea here is when the slavery starts make sure that replication doesn't start you cannot start at this moment the GTID position cannot be trusted but the file and position can be trusted so what you can do here is you note the GTID executed of that slave and then you wipe the binary logs reset master like the binary logs on that slave cannot be trusted then you restart replication with file and position which is trusted and then you need to restore the GTID position and this you'll have to figure out by yourself it's left as an exercise if you're not able to figure this out by yourself you shouldn't do this voodoo so that's how I run my slaves with sing bin log not equal to one so in conclusion like we saw the consequence we understood it we understood how to avoid it how to mitigate it so I have a guess an educated guess that more and more people would run with sing bin log not equal to one because of the cloud sinking in the cloud is very slow so you need to run with sing bin log not equal to one if you want decent TPS in the cloud and so to be fully cloud ready my SQL should make it easier for us DBAs to run in the cloud so basically I would like auto offline mode after an operating system crash and auto skip slave start after an operating system crash and I have two rants three rants on my favorite feature so like GTID makes things very complicated so there's a GTID state in the table there's a GTID state in the binary logs and now there's a GTID state not in the redo log it's an EnoDB so like there's clean up to do here and one last thing about GTID like somebody left but like this GTID state is just horrible so like we need a way to clean that sorry yes so I have links in my slides classic thing my employer is hiring please rate my talk and I have time for questions three minutes for questions yes Marco yes so the question is so the question is did I do tests about not syncing EnoDB and the context of that question is that in some next versions of MySQL Muradibi there will be features of keeping consistency without syncing EnoDB so like removing a sync instead of having two like you'll double my results like 200 transactions per second will end up being 400 which is very far from 3000 I haven't formally test that but I can do an educated guess here so it's a nice improvement to remove a sync in EnoDB personally I think it's not game changer we're just changing by a constant we're multiplying by two if I would have to choose between this feature not syncing EnoDB and keeping things consistent and making MySQL or Muradibi easier to run with Sync Binlog equals zero I would choose this one but it's like different people work on different part of the server so it's still an improvement but I don't think it's game changer other questions yes it's basically so the question is about having Sync Binlog equal to one so currently there will be inconsistencies like between the Binlogs and EnoDB it will remove one sync like doing one sync on transaction commit instead of two which in the end is not game changer because like we're still waiting for this so I haven't done test and thought much about it but my intuition here is it's not in the current context it's not super useful any other questions if I have time one last question yes would it make maximum persistence every time to one or if you don't have the throughput go to ok it's crash let's fail over so the question here is instead of taking each server and being very careful about the consistency of each server thinking about like MySQL as like the full distributed system and if the master crash fail over to the slave this is the solution in the case where you have an operating system crash on a master if you're running with Sync Binlog equal zero you need to fail over to a slave so because you potentially lost data and your slaves are corrupted so yes that's the solution the DBA needs to think about not one server but the whole distributed system which is like replication is a distributed system so we have to fail over yes I'm not sure I have time for another question but I will be out of the room if you have questions for me I'll be just out of the room and ready to answer your questions thank you very much