 How many of you here have experience in databases? Oh, that's wonderful. So, then you should probably know what the database of recent performance development is. In so many times, I would come to the customer and they would have their website or app slow or down and I would tell them, what's the problem? I'd say, oh, that's a database. I would say, why do you think so? They tell me, well, because it's always the database. So, why databases are really painful to scale and deal with performance? I think there are those few things. One is, generally, you don't have a very linear scalability. You put more traffic on the database and that's enough of the scalability. As your data drops, the queries also typically don't have a linear scalability. The other thing is, database is typically very complicated pieces of technology and also, developers often don't quite understand the problem. Right? Well, you have this kind of query about why that would be slow or fast, how optimizer works. For many developers, that all is really a mystery. So, now, if you look at performance work, we typically do a database. That's what we wanted for those things. Trouble shooting, right? That the problem happens and you need to the patient make fast what was slow before. Often they also have to do capacity planning. That means to understand how much traffic, how much load the database can handle. And also, increasingly now, we have to do something to cause an efficiency of communications, which is especially not in the cloud, right? You can really be, if you're wasting money, you can sort of scale the instant out and start saving them to date. And also change management, right? And by change management, I mean, if you go ahead and you upgrade to the next database version, chances are something will be going faster, something will be slower, and you need to work to figure it out. Now, when you look at the databases, you can pretty much look at from two points of view. One is Blackbox, and that's how application developer, developer, look at that. And then there's also Whitebox, where we understand a lot more about the in-journals, that is how you give the SRE, other folks, people, you know, often have to look at that. Now, if you look from a database developer point of view, look at the database as a blackbox, it looks very simple, right? I have my queries, I throw a database and it responds to me, right? Very simple. And developers often think about, oh my gosh, why those stupid nasty database that make this database just work, right? Why? It kind of just handle the queries I throw at the edit, right? Now, what is interesting about it is with the cloud and a lot of phrase of database as a service offering, often if you, even if you're in operations, that is as much of visibility and that's the mode of operation you would have, right? Because you may not have complete insights with internals what happens with the database as a query service. So if you think about from developer standpoint, it's very simple to look at the criteria. What makes a good database, well-behaved database, well-performing database, right? You want that to be up. Of course, when you send your queries to database, you care about response time. You want them to respond fast and you obviously want to have a correct response, right? Correct responses or data being updated and produced correctly. And also you want obviously that not to cost you a lot, right? To be efficient. Again, especially in a cloud system for some times, they can sort of, let's say, optimize the scale for you, right? But if you are doing stupid things, it's going to cost you stupid money. Now, from an off-point of view, we often have to look at a lot of more introverts, right? You'll hear about the load of the systems, the resort utilizations, if there is, let's say, some operating systems or hardware problems happen, right? You have to deal with that, as well as skating and capacity flying to handle the needs of applications. So that is the overview, right? A very brief one. Now, if you... What you'll spend more time here is looking at the methodologies for troubleshooting and analysis. And let's see how you find many people and what I was going to say, is troubleshooting by random Google. Right? I mean, if you go ahead and you have some problems you understand, you enter up in a Google and hopefully you'll get the response in the front page, right? Now, the problems are typical approach. It's pretty much hard to assure outcome, right? Because as wonderful and smart Google is, it's so much engine-ranking that it doesn't need anything to do with how relevant the answer is to your particular problem, right? It's also hard to train people, right? Because that is way too general. And that's also hard to automate or even create something as a checklist, right? To go through. And that is why there are a number of methodologies to troubleshoot the database. And I will very briefly cover a few of them. None of them are really database related, but because we deal so much with database component, I will talk about how they apply the databases. The first one is the use method by Brandon Dresd. Anybody got to work with him? Oh, well, fantastic. Many heads. And what the use method is or what it was developed by Brandon for is to pretty much troubleshoot server performance, first and foremost. And the idea is how can we resolve 80% of the problem with 5% of the effort, right? So it doesn't play A by following those simple steps, simple checklists. We can solve everything, but we can solve a lot of them in a relatively simple way. The great thing about this method is also that a lot of the great checklists are available, which can tell you, hey, if I want to follow this method for this environment, that is exactly what I can do. Now, if you describe that method in one sentence, it would be for every resource, check the utilization, saturation, and everything. Now, there are a few definitions which this method would use. First is we talk about resource. And resource would be all server physical components, like CPUs, disks, memory, and so on and so forth, right? These are kind of your basic foundational resources which have certain capacity, right? And, you know, if you try to make your disk to do more IOPS than it's designed to handle, well, it just wouldn't do it. Now, utilization, the measure tells us what the average time resource was busy servicing for, right? It's kind of pretty simple, busy or not. But now, even if our... But another good measure for resource usage is the saturation, which tells to what degree resource an extra work to do, which it can serve as often to. Right? So, disk, you may be looking at the disk utilization, but also queue depth, right? A number of those requests, which are being weighted in the queue, that would be a measure of saturation. Now, another thing which is important to look at is a ton of error events, right? Because error events can cause both resource to slow down, right, and perform poorly, as well as, in some cases, when you over-subscribe the source, right? You start getting error. So, for example, you think about the network, right? Those two get network over the drain, become slow, but sooner or later you simply can't, let's say, establish connection, right? Because of timeouts, you just get an error. So, here are the sources which we often have to deal with. It's CPUs, right, which will be your CPU ports and hardware reference memory, right, memory capacity, and it's very important here. The network interfaces, storage devices, again, for storage devices, you would look both at IO, the usage, as well as the space usage, because obviously if the disk runs full, it tends to cause a problem. Your virus can throw in a storage network and virus interconnects, right, where interconnects between different devices can become overloaded and cause a problem. Now, this all talks about applying the user method to hardware, but how does that work with the software? Can you really use that, right? And the idea is, yes, we can use that in software as well, and the same basic principles would apply, because if your hardware is over-saturated, right, by demands put in a software, you'll have performance problems. But in software, there are some other components which apply. So, for example, muticlock, right? In many cases, you may have hardware not being over the... used over the high, but you have mutic contention, right? And you can think about that as, you know, a resource of the utilization of iterations. Another good example is file descriptors, right? I mean, if a software runs out of the file descriptors, then it probably will fail. Or in case of database, such as MySQL, one of the common problems in capacity planning is not having enough marks connection centers, right? You run out of marks connections, your application will start failing, right? Not because of any hardware resource usage, but because of the software, the resource gets all completely used up. Now, the use method benefits, it has a proven track record, it has also pretty broad applicability, and there are a lot of detailed checklists available. Now, the drawbacks here would be what it really requires you to have a good understanding of a system architecture, right? What different programs are having to respond to resource usage? It also really requires your access to a low-level resource monitor. So, for example, as I mentioned, if you're using database as a service software, you may not have as much details about how that all maps to different disorders and what's happening out there, right? And in general, it's kind of really, I assume you have a lot of that, wider box understanding and insight. So, here comes the red method, right? And that is, you can think about that as a, and not a look, essentially the same resource, but applicable to more of a wider system. Red method focused a lot of micro services, right? When you have a sponsor page, we are not teaching our system components as pets. They can be very transitional. They can go ups and downs, right? And linking to resources with micro services in this container can be hard because it's kind of loose, right? You can't have a container mainly right in here and then in another case. Now, to be honest, even if the database are as pets, they can offensively import, not pets, but cattle, intensively import cattle, right? Because of that kind of state and kind of, it's really a relationship to the data. So, keeping that in mind, too. So, red method says this. For each service check, right? Remember, in this case, the service, not the resource, we can check what those items are within SLO, and that stands for service level objective, which is rates, error rate, right? And duration, right? And duration talks about response time distribution here. So, that means if I have, let's say, a certain function on a server level, I should say, hey, this resource should be able to handle up to a thousand requests a second. There should not be no more than one error per million and, right, duration, I should get 99% of the response in 10 data centers, right? Something like that. When I apply that to the databases, of course, we can look at that from different levels. It's a good idea to apply that on a service level. For example, you may have some sort of load buttons in front of the database, but you also want to look at that often on the individual database servers because often you would have only one of them malfunctioning because it's overloaded or maybe because databases have some stupid execution plans on that particular board because that is a couple. It also can be applied to a component and resources inside the database service, right? Like, for example, you can think about the disks or other components. And also, I think what is wonderful, it can be applied to individual types of queries, your problem, right? Because different queries, we tend to have different goals. You search query, you may need to respond 100 milliseconds while you can simple look up query, you may need to respond 100 milliseconds, right, or that. Now, in terms of benefits, I think what this method really covers, you can see pretty much the same things we talked about, right? If you have a developer here, then they work with databases, right? So that, I think, is wonderful. And it also doesn't require as deep an understanding of the systems architecture as a user method does. And it also does not require as much access to low-level information, right? You can really see what kind of servers are giving you trouble but pretty much, I'm sure, not from the outside, if you do, right? Now, if you have a drawbacks out there, it doesn't have as much detail to check the list for that, or well-developed kind of detail, the tool sets developed for that yet. And also, it focuses more on finding what is having a problem and why, right? So I can't identify that as this micro-servers, for example, is malfunctioning because it's not really working well, right, it's too slow, we throw into many errors, and then you can get, you know, that micro-servers developer, right, or irresponsible person, or irresponsible, yes, in this case, it's malfunctioning, and go and fix it. Now, the third one, and the last one is all the four golden signals, and that comes from a very well-known, by now, is chapter of the Monitor and Distribution Systems for SRG, Siteability Engineering book, which is based on lessons in the Google. And it is not just a monitoring focus, it also can be used for other things, such as algorithm and trend analysis, right? But I think it's very applicable in this context as well. So what are those four golden signals? First one is latency, right? We are looking at the latency of requests, and in this case, we want to look at the latency distribution, not just averages. Average isn't generally, is a bad measure when it comes to performance management. We want also to make sure, when you look at the average, at the latency, differently for successfully requesting errors. Because in many cases, your errors can be thrown very quickly, right? Let's say, if I'm trying to, let's say, to connect to MySQL, to run some expensive query, can I fail? I will get an error, and I will go back to the daily, right? We also want to look at the traffic, right? And this is similar to a measure of rate, right? How many much requests a system is getting, right? How much demand is generated in the system? Because obviously, these are, let's say, too much traffic, too much demand. Hey, here we got it, why the system is not partitioned. If you look at the errors, right? And for errors, there are kind of two kind of errors, right? For systems in general. There are errors which are based on error codes, and these are typically easy, and there are also errors which are, when a content is wrong, which is a lot harder, right? So for example, especially for database, if I run select query and deposit a bug in a database, I get the wrong content. It's very hard to find out, right? Typically, it gets user to complain, right? Or some automatic test may be failing, but it's hard to integrate that with performance management practices. Thankfully, it doesn't happen that often. And the saturation which tells, in this case, about more highly leveled saturation is how fully resistant is the capacity, right? So if you have done capacity planning, and you know, for example, oh, my system can handle, you know, let's say 10,000 per second, right? And you are at five right now, right? And you can talk about, you know, certain use of capacity. So for databases, what you can do here. For latency, look at the response time distribution and especially good by query digest, right? And just to clarify digest is essentially type of query. Query is all the kind of parameters picked out from that. Graphic, that could be number of queries. And again, you can look at specific kind or by the digest. So that's a very good measure for MySQL. Error codes. For errors, that can be error codes but also wrong query responses which, again, is particularly hard. I would point out for MySQL, a lot of that's kind of wrong responses they come to in appropriate use of replication, right? MySQL replication is a synchronous and if you don't have your good monitoring in place, you may have a slave, like in, let's say, half an hour, you go to it and you get the very old stale data. If your application is not keeping in account for that, that is the most typical case when developer thinks he's using their own responses, right? Just not keeping that a mySQL specification account. And the situation. We can look in this case at the basic resource usage, CPU, disk, wherever, but also things like connections and the disk space, right? That's also very important because often, programs happen and you're not out of a disk space. Now for our golden signals, that is a methodology which has been working well for Google and some others now and I think there's a good resource book available about that and how that fits in the general concept. Now drawbacks, again, that is not just monitoring and because of that, there's no very specific checklist which are available. So now from takeaways for me, like if you look at all those methods, they're all related recently, right? I think the first communication about the use method that was probably about, you know, for five years ago, right? I think other two came somewhere between last two or three years and we have a lot of methods being developed for this and our industry now, right? And also what is interesting in this case is really compare those methods very easily and significantly overlap them, right? They're all looking at those relatedly same things, same thing just from slightly different angles. Now for us at Percona, we are working on figuring out how to get those methods to be integrated with an open source tool for monitoring. We are developing, which is called Percona Monitoring in Management and right now, it's mostly ideas we don't really have too much insights, too much of a specific method of experimentation out there. And the final thing I should say, if you guys are in the open source databases, we are arriving at the conference in April, so we would very welcome you to come and attend that. And that's it. Thank you. I would be happy to take some questions outside and also we have a table here, right? Thank you very much.