 Hi everyone, I hope you can hear me in the end, welcome. I promise that I'm not going to make you suffer too much on this talk. I just want to present you the idea of finalizing CPUs. So just a little bit about Percona, probably already heard these out there. I'm not going to stick about it. A little bit about me. I'm not going to keep bragging about myself here as well. I'm a support engineer from Percona. And a few days ago, it came an interesting case for support. Common one, most of you that deals with software already got it. It's your CPU is burning. Something is completely taken over. And in case of a database, you can't do Q minus 9, of course. You want to keep your services running. So the idea of this talk is to present how we support investigated this problem. How we attacked and identified the issue using flame graphs. Also in the end, I'm going to talk a little bit about cold graphs, which is another approach. And I explain details a little bit. So first, this slide is a homonage for Brandon, Brandon Greg. I'm not sure if all of you heard about it, about him. He was the guru, the mentor of flame graphs. Me, I'm just a padawan here. I'm just summarizing the things. But his content, it's much more broad than what I'm going to show here. So what are flame graphs? The best way to define flame graphs is using the words of the creator. So basic flame graphs are a way to visualize the profiled software. And in this way, we can quickly identify what are the threads running, what code paths are running on the CPU. So saying that, let's be a little bit more specific. And what is a code path? Most of you are aware of. This is a really simple code. Most of you saw it on universities or doing your work. Main function, some functions that are called during the process. And this will lead us to call chains. So what are call chains? Call chains are possible ways that the functions can work based on the behavior of the user or the behavior of the database. So in this example that I showed before, which is like possible library software, what we are doing, getting book lists, getting the book and the outer, or the editorial, or getting a print, whatever. These are possible call chains that can happen of your tree. Of course that we are talking when the software is running fine, not when he has a bug and it crashes in the middle. Saying that, we have the possible call chains. What do we do? So we profile it. So what is profiling? Profiling is a way of capturing metrics for a constant period of time. And then we can aggregate and combine and analyze it later. So there are several ways of profiling. Most of you already use it without knowing the name. Raise your hand who never got the issue from the developer that look, my credit is slow. What do you do? You enable the slow log, then you set the long query time that you want, and then the queries are going to be profiled in the slow log. This is profiling as well. And later you use some tool that you want, PT query, just, grep, whatever. You aggregate the queries, identify the ones that are using more roles, consuming more time. And then you use the metric that you want. So this is a profiling, a SQL profiling. In this case, I want to profile stack traces and code paths. And for it, to analyze the burning phase, we are going to use Perf. Perf is widely used on Linux. I think most of the installations comes by default with Perf. It's a really simple install and you can install it. So with three, one, two, three simple commands, you can create the SVG file. And what Perf records do, minus A means that we are going to profile all the CPUs and minus G, we are going to capture call graphs or stack trace, minus F99 means the frequency. So we are profiling the CPUs at 99 hertz. And minus P, this is the little trick so you don't have to do PS and get the PID every time. If you're running a single instance, of course, if you have multiple ones, then this is not going to work. And we are going to profile this for one minute. We are going to create data for one minute. After that, we are going to make this machine readable, using Perf script and sending it through the output. And finally, using the tool provided by a random, we are going to create the flame graphs. Now, let me show you the flame graphs itself. So this is a flame graph. Basically, this is the data that we profiled using Perf and after creating the SVG. So this is interactive. So you can zoom in and zoom out. I'm going to show you for the analysis. It's a little bit... So how do we identify? How do we work with this bunch of colors and data? So you have the y-axis and the x-axis. And when we are analyzing on the vertical thing, you are going to see several stack trace. We have more above, but I'm not going in details. Let's suppose this is the last one. So the last one is the one that is current on the CPU and the other ones are its ancestors. So the best way to read is bottom up. So because I profiled my SQL D process... Sorry, guys, you may not see, but we can see that 100% of the CPU time was spent on my SQL D. Why? Because we were profiling only my SQL D. Now, when we look on the x-axis, the order from left to right, it doesn't matter. Like, this has no logical order of time. It's only sorted in an alphabetical order. And why? To maximize the aggregation. That's it. There is no more meaning from left to right. But when we are talking about the size, the width of these boxes, this is what matters and this is what we are going to analyze. So as larger becomes the box, more CPU was spent on this process. And here comes the good thing. This was based on a real case. So if we start, my SQL D is using 100%, but when we come to this outer table, we can see that 54% of the sample that we got was spent on outer table, which is interesting. So when we go deep and we zoom in, we can see that now we have only the stack trace. For it, we have the functions that my SQL is calling. And these bars start to be constant. And we see copying data, insert on secondary index. And now we see we are spending a lot of time making foreign constraints. After that, we are seeing freeze on that dictionary. And this requires a little bit of understanding the code, but this function means that we try to acquire the lock here, but we couldn't, so the process is spinning. So it's waiting. And after investigating, we collected some interesting data and identified that because we are checking foreign constraints and the database of this customer had a small table open cache. We saw several tables being open and closed. So this involves has a high cost for my SQL. Our proposal was increasing table open cache. And with that, we could reduce the size of this. And it was smaller than the CPU, not that stopped burning. This is like a common name. But we made this whole process more friendly for the CPU and the database could go on more smoothly. So this is the real importance that we evaluate for flame graphs. So moving on, I'm going to talk a bit about code graphs. So what is code graph? Basically is what we call off-CPU analysis. So if the previous flame graph was analyzing and identifying what was running on the CPU, code off-CPU analysis and code graphs identifies what is making the thread wait to be executed. So we have the both sides. And in the end, I'm going to talk how do we marriage them. So off-CPU analysis is where we are spending time waiting for things. As we can see on this graph, flame graphs are responsible for these two things. And with off-CPU analysis, we can identify why your thread is waiting. If the thread is waiting for disk, if the thread is waiting for disk network, or in case of MySQL, we know that there are several locks. If the default isolation level is repeatable read, if you are inserting or looping data, you have the infamous gap lock. So this might make the thread wait. And this is part of the off-CPU analysis. How do we do it? This is made with BCC tools. So for kernels that are newer than 4.1, you can just install it. It comes already with the operating system. I just pasted, for my opinion, the most true common ones, Red Hat, Santos version, just a Jung install, or Ubuntu, Debian, the APT get install, BCC tools, and then it's going to install it for you. Our way to validate, just check if this directory has the tools. And you are going to see if you are more curious about hundreds of scripts. You can do profiling with BCC tools for disk latency, network, every sort of things. This would take a life to learn. There are much better people than me about it. But the one that I want to focus now is the off-CPU time. So with off-CPU time, the minus D option will put a delimiter. This is for the parsing of the stack trace. F is to make the stack trace folded. Minus P, again, is for identify the PID of the process. And we are going to collect for 30 seconds. The same idea. We are going to produce the stacks. And here we can use the tools from Brandon again to create the SVG files. So let me quickly show to you. So this is the code graphs. And this is extracted from my SQL that I just put it up. And it was idle. There were no queries running on it. Then I run a test with the classic sysbench, of course. And this is what is happening when the system starts to get saturated. And as we can see here, we are spending a lot of time handling connections. And this is waiting for a read or write on the IO socket. But the curious thing, and I did this on purpose, my SQL I put very limited resources on a single core virtual machine is that I was opening and closing connections for every single thing during my workload. One proposal here after we analyze it, maybe using connection pool on the application or use thread pooling on my SQL side to reduce the amount of time spent opening connections. And to resume, of course, these, not only these, but the flame graphs, they are absolute numbers. So it's interesting when you collect both. And on this way, you have the 100% picture of what the threads are doing and waiting, and then you can correlate everything. Because just taking a look at this, you can't infer if your SQL is suffering or not. So it's good to have both sides of the story, the running one and the blockade one. That's it. That's it. If you have any questions, I am available. Sorry for being short. I hope you got the idea. We are going to talk more about this per quant alive. Or you can catch me around here taking a beer. You'll be really glad to answer all your questions. Thank you. Okay, so if someone has questions, I am available. Yes. It looks very impressive. I was surprised that it was readable. The calls were in the stack. Is that, it doesn't look like it did any special compile? No. Can you repeat the question? Ah, okay. Sorry. So his question was if I did anything special to MySQL to analyze the code paths. For the MySQL community version, there is a package for RPMs, the debug package, which install the symbols. So like you just need to, if you do the bundle installation, it will come everything. So you don't need to compile anything special to analyze it. Yeah. Like no magic. Just another RPM of the bundle. Another thing that I usually do, I get on GitHub of MySQL, for example. I have on my repository. Then I go to the search code and open the function and I don't know. And the function is really well documented on the header. It's really easy to... That's what I was saying. But my question was more actually, I mean, most of my time I'm working with Oracle Postgres. And I know Oracle deliberately obscures their function so people can't unpick what they're doing. I really dislike about Oracle, but that's how it works. But actually, a lot of the time I'm working with Postgres and also I'm getting MariaDB as the same name convention. Is that right? I'm not sure because Maria on recent versions of the left in MySQL is going through the right. Like they are not... But in the past, the functions were across. Like most 99% nowadays they are really starting to differ. I can check for you, but for Postgres, I don't have... Sorry. It was blocked and do not... You only get the name of the function. Sorry? So Postgres is running and then it does something that makes it block. It becomes obviously viewer. Yes. But what you show I think is only the function names, function that's happening. Is there some way to actually see which line or area of the function that this... No, then you have to investigate the source code and like for go to the specific line you can use GDB, but then this is like... GDB is attached to the process and then you can literally dive into the codes and every single line of... But this brings a lot of overhead, so we don't use on production. But it's an option. If you don't have shell access to the... Is it possible to do anything from the... No, because these are... You need the libraries behind it. Well, I think that's it. Thank you guys.