 and how you can prevent lag on MySQL. How many people are familiar with MySQL? Okay, great. All right, so I'm going to do a quick intro. Then I'm going to talk about the origin of the idea, the technical challenges that I have faced building this little script. Then we'll go through the logic flow of the script. I'll do a brief demo. And the possible improvements of the script. So I'm a senior MySQL DBA. I've been a MySQL DBA for about 10 years. And I've been a MySQL DBA for about 10 years. I have some experience in database modeling, database architecture. I work for a few large companies, such as the Vita, Rubicon Project, Shukar, Zip Recorder. And now I work for WAG. So WAG is hiring. So if you're interested, you go to wagwalking.com, and then you scroll down all the way to the bottom of the page in the career, and they're hiring because we're growing. So I just started working there about a week ago, and I was hired to scale there MySQL stack. Now the origin of the idea. When I was working at Rubicon Project, we had masters and slaves, and oftentimes we had issues. The slaves will stop. So we had to figure out what went wrong. So I started using MySQL binlog to look at the binlog. And sometimes you need to find out exactly where the errors has occurred, so I started using that tool. So it was very much for replication debugging. Going into the binlogs and looking at potential issues and where to restart the slave, et cetera. Then I realized, well, the binlogs, people are familiar with the binlogs. So the binlogs are used to replicate from a main server master into a slave. And the binlogs contain all the transactions, all the insert, update, delete, that occurs on the table, gets stored into the binlog. There's different format statements, all row, all mixed. The most common one used is row format, because it's much efficient, and it allows it's more precise. There's no issues with timetable, et cetera. And when you do large updates, it's much faster. And it doesn't have as much lag. It prevents lag on the slaves. So I started doing those binlog analysis, and I looked at it and said, well, maybe I can summarize the data and find out how many insert, how many update, and delete. That server is producing and going through each day. That allowed me to detect surge. Sometimes for the reason the slave will start lagging. There was no specific queries running, but there would be a spike, a short one, on the slave, and then it was not easy to find out. Using the binlogs, I was able to analyze it and see that they were a big update that would go through, because when you run a row, you don't see anything on the slave. When you run a statement, you see the statement running on the slave, but when you're in a row format, you don't see anything. So that's a little more difficult. So that's why binos in the binlog, you really know what's going on. So it really helped me find the replication lag and pinpoints to the table and focused on which table to diminish the number of writes on a specific time during the day, because when your traffic goes up and you have batch creep running and you're inserting a lot of data by joining other tables, you can easily insert thousands of rows into your database. And then it helped scan for planning. I mean, plan for scalability. We'll go through this toward the end. I'll explain to you how being able to look at the binlogs can help you scale MySQL. So I had questions sometimes from developers or from my supervisors, asking me, okay, how many transactions do we have on MySQL? You can use performance schema. This will give you some metrics. But across time, it's a little more difficult because performance schema is limited. You cannot store as much history. You cannot record that, but it's a little cumbersome. Performance schema also adds some load on MySQL, where the binlog gets replicated. It only has a little impact on the server. Now, how many inserts on one table between this time and this time? This is a little more difficult because now we are looking at a range of time precise. I want to know between 2 o'clock and 3 o'clock on that table how many inserts we've been doing. Do we delete data on that table? No need to look at the code. You don't need to go look into your PHP Java or whatever script and grab it to see if there's any delete on that table. Just looking at the binlog, you'll know which table is the busiest, which table is going to grow the fastest across time because if you have a lot of inserts on the table, that table is going to grow and it's going to need a lot of space, memory, so you're going to have issues scaling and the trend. Trends are very important. When you want to scale, when you're building an application, you want to make sure your database stays the same, that it grows slowly. You don't have spikes or big changes that affect your application. For example, at WAG, when I started working, they had lags, so that means when you have lag, you cannot really use it to run your select statement because there's a delay between the master and the slave. Slaves are meant to scale your reads. You move all your select statements to the readers. By looking at the binlogs, you can detect this. You can see quickly what's happening. This is an example of... once you start analyzing the binlogs and you store it into a table, you can start graphing and you can see quickly where you need to focus. On the left side, I blanked it because they were a real table, but on the left side, you would have the list of tables. Here we have TPS, I mean, transaction per second. They are really the number of rows that have been inserted per second. This is where you need to focus. Quickly, you have the blue is the delete, the orange is insert, and the red is in update. That table is mostly updates. This one, look at this. Three quarters of it is updates and one quarter is updates, inserts, et cetera, et cetera. At a glance, you can say, I'm going to focus on this block because these are probably all the transactions in the table that are generating a lot of lag. Then you can look... I think this is across time. Zero to 24 hours during the day. How does our transactions behave? Where are our peaks and what's happening during that time? Same thing, right? Updates, inserts, around 9 to 10, right? You see a scale here, and then it goes down, right? And then during the night, nothing. So this again, it allows you to monitor it, and if you, by doing trending, you can quickly see changes across time. Now, something I wanted to point out is I use a technique called row weight, right? So I grab the row size by joining the data size, right? Divided by the table rows, and I get the row size, then the index size. I add them and I multiply them in a transaction per second. It gives me the row weight transaction, right? Because some tables are small, a few columns, and some tables have large columns. If you're doing a lot of updates on a small table, like a few rows, that's fine. It's going to be fast because you're not shipping to the slave a lot of data. But if your table is wide, if your table has text fields, VARCAR, et cetera, and you're doing a lot of updates and inserts, this has to shift to the slave. This use network, memory, processing, all these counts. And that's where the slaves start lagging. Another view, right? Schema. Which schema is doing the most transactions? Now, if you have only one schema, well, that won't help you much, right? But if you are hosting websites and you have hundreds of database and you want to see which schema is consuming the most bandwidth, right, or CPU, this will help, right? And if you add a new database, right, you see here nothing happens, and then slowly it stops going up, then you can monitor also those new schemas, having, going to scale. So across time, you can monitor this and you can plan already the growth. This one also, same thing, right, but using the weight. So it's a little different. Now we are half, right, of inserts and half of the dates because I applied the raw weight, you know, to the table. So as I said earlier, the bin logs allow you to, you know, to detect search, right, by using trends. The script analyzed the bin logs, stored the bin logs, it summarized it into a table, summary table. Then you can plug any tool, right, you could plug Periscope, any graphing tool to it, you can download your data, plug it into Tableau, Excel, right. Then you can start graphing and see, you know, how your database is behaving. And you can detect across time changes. Prod release, right. You do a prod release, you push some code into production and suddenly your CPU goes up, right. Everything is normal, but it is more CPU usage than network usage. By looking at the bin logs, you know, you can see that if there's any additional insert, update or delete that have been added that are impacting, you know, the database. Am I doing a lot of updates on the table? Did I put a script into production that, you know, is not right? That is updating too many rows. Batch jobs, same thing. You deploy a batch job on production, you know, you can affect, you know, production. You can affect master and your slaves. Because you have a script that runs that builds too many rows. Large outer. So when you outer large table, there's some techniques where you create a temporary table with the proper schema and you migrate the data in the new table. This will show up inside the bin logs because you're going to be inserting a lot of data. So this is also another type of graph you can do, right? You can compare today, this is July, how, you know, we're doing and here we can see that we have more transactions per hour than one week ago. And you can compare by table level. Something you cannot do in performance schema or other tools, right? And here it's per table. You can see that table which, you know, we have a big problem. Huge. Right? In one week, we went from here to here. Something happened. So what you can do is you can query the bin log and see which statements, you know, cause this and then you can fix it. Cause everything is stored. Right? All the insert and all the update against table are stored and summarized. And then you can query this. So the role replication format is the most common one when you use GTID, you allow you to do multi-threaded. So you can increase the speed of replication. And when you use logic on clock, it's a little better. Aurora is getting a lot of attention on the market. They just released 5.7. I'm not sure if they support that. Aurora is not, there's also issues with it. You cannot scale. You cannot forward everything. You still need to optimize it. There's some limitation. So by using the bin log, you can look at the wrong queries or transactions that you're running against Aurora. So large update, delete and master, right? We'll generate lag on the slave. There's a lot of transaction per second, right? Which one it is? Which one is generating all those issues? And then when you have many rows that are affected, it becomes even more difficult. Because when you use role replication, if you run one SQL statement, one update that updates a thousand rows, you would have a thousand rows in the bin log. Right? So that's why you really by using role bin log, you really see what's going on at the table level. How many rows are being shipped? Instead of statements, if you were using statement replication, running one statement on the master will run one statement on the slave. With role, it's really different. On the slave, you can also turn on the bin logs. And if you do that, two nets, you can see, okay, maybe there's some table I don't want to replicate to the slave. Like, for example, a log table, right? I just need to have it on master. So you'll save, you know, some bandwidth. This will help you remove the lag. I mean, log table should not be in my SQL period. To move that to S3, right? But, you know, maybe for some particular reason, you need to have a certain table. That is large. You don't need to have it on the slave. You can, you know, filter it. Or you can also scale by partitioning certain slaves. You can say those slaves, they're only going to replicate those table. So planning scalability, right? So we saw that we have a precise DML monitoring tool. We can see everything what's going on, you know, on the database. We can monitor the surge that can occur during the day. And scaling is not always about adding more machine. You can scale by reducing. And that's something we did when I was working at Zip Recruiter. We had to scale 10x. That was the goal. But if you want to scale 10x, you can reduce by 5x. Right? You have, let's say, a million transactions per day. If you divide that by 5, right? You have room. You immediately create room for growth. Right? And you don't need to buy new hardware. So the logic flow of the script, right? First is to get the bin logs to download. To download it. To get the start and end date of the bin log. So that you know which time frame this bin log corresponds to. Because they get generated. Right? Usually you set a size. Every, for example, 100 megs. And your bin log is going to be created. Then you pass it. Then you store the past DML. And then you keep track of the bin log process. So that you can rerun it. Every day. Certain hours. So it's pretty simple, right? Nothing complicated. Download bin logs. Get the start, pass it, store. Last bin log, yes or no. Go back. And it worked. So we're going to just look at the script very quickly to see how this is done. Now the technical challenge was grabbing the bin log start and end date. This was not easy. I had to spend a lot of time to figure it out. But by doing some gripping, you know, that was possible. Passing the bin log, I found some script online. And then I made it better. So if you have a long host name, if you use Amazon, AWS or RDS, the endpoint is usually very long. That's not very elegant. You need to pass it. If you use an IP address, something you need to pass that because you're going to have dots. So it's not, there's some passing that needs to occur. So I'm going to look at the code very quickly to give you an overview. All right. So it's a very simple script. I didn't do it. It's in batch. Right. So the first thing I do, I create a table that's going to store the counts. It's here if it doesn't exist. I clean up the bin logs because sometimes the bin log hasn't closed. And when that happens, the range is negative. So I have to get rid of it and then reprocess it. Right. And then I have a loop. I do a show binary logs. That's a command for my SQL to show all the binary logs on the master. I get the list and I do a count has the bin log been processed. Right. This is the bin log process where the bin log is equal to bin log and the hostname is equal to hostname. And then I start the processing. So you can download the bin log from a remote server using my SQL bin log read from remote server. So you do that. It's going to connect, the bin log downloaded. And here I download it locally. Because it's a raw format, you have to use raw and then you start it into a directory. Then I process the start date by doing some gripping. The bin log hand date. And then this is where the magic happens. This is the entire script that basically parsed. You see here a date, insert, delete, replace, create, alter, drop. So you can add on commands if you want to. It's not limited. Then I load data into a count table. I truncate it first and load it. And then I start into a summary table. And then after I update the process I clean my file. So what you can do also is, as soon as you download your bin log, you can copy it to S3 for backup. This way you have a copy of it. There's so many, you can do a lot with this. Otherwise the bin log is loading. So it's a very simple and I didn't spend too much time developing it. It's very basic but it works. I schedule it and it downloads and it processes the bin log. So let's get to the long demo then. So here in your bin log I pass the IP address. So it checks process. This bin log has already been processed. So it goes to the next one, the next one. And it finds the other one. And now it's processing. It's getting the start date of the bin log, the end date. Then it's going to run off and then store it into the table. So those bin logs are stored into... First you have the bin log process table. So it tells you the start date and end date of the bin log and the bin log name. Right? So that's also useful if you say, oh think of it six months from now. You have backed up your own bin logs and you say, okay, I want to find some data six months ago. You can look at the bin logs and see, you know, what was happening and what we were inserting at that time. Right? It's... You can refer to the bin log that you store into S3 and bring it back. We process it. So it's very useful. And this is where you summarize all your data. So the host name, you know, I removed the dot for the IP address. The bin log started and date, toll transaction. So toll rows that have been processed. The table name. Right? So schema table and what type of query. And on the right is the bin log time range. So it's basically the start date minus the end date. And this is useful because you can quickly see that the shorter the time range the more transaction you had. Right? Because you're storing more transaction in a very short amount of time. Your bin log is limited to 100 megs. So when you insert a lot of data it's going to get filled very quickly. And that's what happened here. I mean, look at this. Here I had a huge insert on this table. And that occurred between this date and this day. Etc. So using that you can write some very simple query and you can start, you know, summarizing it. Here by date, by hour these are the total transaction that my database is going through. Right? Same thing. Date, hour, what type of query the actual query and the transaction. So a little more detailed. When you start this you can slice it you know, anywhere you want becomes very useful. And the last one a little different you know, you summarize by hour you see the total transaction you see the total transaction and then number of four per second. Now the possible improvement. So hardening the script like I was saying, you know, it's a very simple one it could be better developed to do more for example, passing the bin log on top of the hour that could be useful because right now the bin logs they don't always rotate you know, on the hour they rotate anytime they're full when they reach the maximum size. So, you know, when you stop, I mean because it's a range, the range doesn't always stop on the hour. So you could have it you know, from one day to another so it becomes a little more over time you know it's okay because it's a trend right, but having it on top of the hour will guarantee that you have transaction you know, and even better if you process those bin logs by minute then you increase the granularity and you have more detail for your transaction and a graphical representation like animometer so animometer is a tool that was built by box I think for the so-called logs something similar could be done where you actually graph your bin logs and when you zoom in it will query the summary table and give you more detail is there any questions? Yes a pretty great digest is managed usually for slow queries so this is the bin logs and slow query usually is for selects I know it converts the insert and update into a select statement the slow query log is mostly for you know for the slow queries since there is a bin log tool available I decided to use that you can also use a start date and end date that allow you to look at a certain time frame within your bin log because you can this gives you a pretty good idea but sometimes you really need to go within 2 seconds because you have thousands of transactions per second and sometimes you need to analyze between 2 seconds so the bin log tool will do that it will allow you to really focus on that particular moment and see what happened there I haven't done that yet but that's something I would like to do I'm not too familiar with Gates and how to make it open source but yeah definitely if you're interested that want to work with me better the code make it open source great totally open yes definitely yeah I think I didn't title the presentation well I should probably do it for my SQL lag or scalability because that tool really helped I mean a zip recorder that scale Aurora because Aurora when you have massive inserts and updates it affects the readers, the replicas so being able to identify those corporates very quickly is very useful and there's also a limit in bandwidth on Aurora so it's all network related and replicating because it's costly it's not free so sometimes you come on board and you have table using text field or Valkar and you're inserting a lot of data and JSON this will impact your your SQL yeah some spikes as a procurer we had spikes that would occur at a certain time and we could not pin it down just using some monitoring tool we just see the spike and we didn't know what it was so by using the bin logs we were able to focus on that particular moment and going into a thing oh that's this table during that time now you have the table, now you have the time you can go and grab your Chrome then you find the batch job that it's doing it or you can look at the logs your weblog that's most of the time that's what you need the more information you give to the developer the easier they can find the information which table it is between that time tell me what happened and you can also call it with log stash if you're storing your your transaction bin logs in log stash you can also because you have the table name you can join with information schema and get the size of the table data index you have the read also so you can start the same thing building a more precise monitoring tool and the good thing about it is performance schema is great but it consumes memory and it also affects it does use resources because it has to store in those tables where here it's totally transparent nothing happens so if you want to reach me that's my email address feel free to email me if you have any questions I'm going to look into putting the code online make it an open source because it saved my life many, many more time at work many time I had to use it and it's very quick once you have those query built you just run it boom, you have the answer immediately alright, thank you