 Hello everybody. How are you doing? Okay. So pretty much everything has been said about me thus far. So as David mentioned, I do monitoring for a living and we use Postgres. So I figured the one thing I wanted to convey or share is, you know, overall lessons learned in monitoring Postgres, and hopefully try to present some way to do it sanely without going crazy. My Twitter handle is there. If you have comments after, you can also talk to me in real life as it were. Okay. So Postgres Data Dog last year presented, it was a talk called monitoring 40,000 servers with a Postgres database this year. I didn't want to present the same, we just updated numbers. So we've grown obviously as a company and as a service, and we've grown a little bit where our database cluster is not that much, which is I think a testament to the solidity of Postgres and the viability as a solid SQL database. So who was at PGConf 2014 out of curiosity? Okay. Maybe 20, 30 percent of the room. Okay. Great. Okay. So this is what I look at every day. It sits on my desk, on my immediate right, and it's basically how I see Data Dog. So it's a lot of pretty graph and so on. What I've circled with red on the right is the key metrics that I watch that pertains to our Postgres cluster. So there's a lot more stuff on it obviously because we have a platform that runs more than just Postgres, but it's still in a fair amount of real estate. So it's a central piece to architecture. For without doing a rehash of the partition last year, basically we use Postgres and Data Dog as a catalog, if you will. So we're a monitoring company, we receive a ton of time series data. We don't store time series data in Postgres because that's a little bit unwieldy. There'd be too much storage. It's just not very optimized for that. But what we store in Postgres is all the metadata. So if you guys are, I guess, is anybody a customer or a user? Okay. So all your tags. Yeah. So you're in there, somewhere. Okay. So now there's a lot of people here, and I'm assuming you're interested in monitoring. Now you may not be very excited by monitoring. So just a quick show of hand. Who really loves monitoring? Oh wow. You guys should come and work for us. Who hates monitoring with a passion? Oh, that's good. All right. I mean, you actually should say, I hate it, and that's entirely possible. Okay. So if I zoom in, this is what I care about. So I'd say handcrafted to our use case. So I care about whether we're getting transactions. That's this guy commits and rolled back. I care about whether we're going to run out of this space, and this is like game over here. Close. I can speak to that in more detail. I care about our apps timing out because they, for instance, the query is not returning. I care about locks on some very specific table. I care about replication lag because we use replication in a read fashion. I care a little about CPU because I'll get into that at some point, but we ran into some interesting issues with kernel, and so I want to make sure that we don't have a repetition of that, or at least I can have a shortcut to the answer if that ever happens again, but I think we're in good shape. I guess after watching this stuff all day long, that's pretty much as much as I can consume for progress. There's eight, there's probably a few too many, but three or four at most is what I care about. I don't really, I'll show you how many metrics you can collect for progress. There are many more. Most of them are not interested in sort of a high level. I'm interested in some specific situation from a sort of pure day-to-day monitoring. That's enough for me, and sort of the premise of this presentation is that that should be the case. You should not have a screen, well in this case, you can have a screen full, but not have to watch 50 metrics, because that means that you're watching stuff that you shouldn't watch. Everybody with me so far? Okay, cool. I got to keep pressing on this stupid thing. Okay. All right, so it's not always been happy times in Postgres land at Data Dog, and that's not to say, I think in comparison with any other thing we use, any other data store we use, so we use Cassandra, we use Kafka, we use Redis. It's probably the one that's, A is given us the least amount of grief. The only exception was probably S3, but that kind of gives you, we have Elastic Search and so on. So we've had, we sort of commonly have, as the service scales, we sort of hit certain bottlenecks with every single piece of our infrastructure. Postgres being one, and so once in a while we'll hit a bottleneck and then I'll go through the methodology that I'm going to describe, and then we'll find the issue and then we'll tweak something and then things will be better. So for instance, so we're running on a 32 core, we don't need to be as we run of probably one of their biggest machines, not DbGus, but we've had cases where this is over the course of two hours where we basically bottom out. So this thin line is the amount of idle CPU left and then the thick line is just a moving average so it's a little bit easier to track and you see that at some point for like a good 20 minutes we're basically running out of steam. So we've had episodes like that, interestingly enough, so we've had things that look the same but actually I'll get into details that are totally different in nature. So where it's, you know, same thing where basically we're running out of CPU. And so running our CPU when you're in Postgres is not necessarily the kind of problem you're gonna have because you may be a lot more IO bound than we are and but you could potentially project the same thing for IO metrics, for IO stats. Okay, another kind of problem we've had what we call lockness when we basically there's so much convention on the box that just nothing moves. And then at some point somebody figures out, okay, I've found a way to fix it and everything goes back to normal. So we've had that. If you've been a customer or user maybe you've noticed, hopefully you haven't. If you've noticed, then we should have told you via our status page that, hey, we have a problem so it doesn't come to a surprise. We've had also this kind of stuff so this is like a CPU graph stack but just out of the interesting thing the colors are really awful. Well, so there's a, for instance, this peak has this. This is a kernel time. So percent of the time spent on CPU in the kernel and this is sort of what Postgres does. So we've had issues where there's a huge imbalance and I'll get into that as well. So sort of a bunch of battle scars with the Postgres but I imagine everybody in the room has had similar experiences. So the born of that was, I think, for me for this year I was thinking, well, what could I talk about? I don't wanna, you know, I'm trying to think of something that's useful. And I was thinking that I could basically present how we approach dealing with this kind of problem. These kinds of problems because they turn out to be very diverse. So, and I just wanna define what monitoring is because I think the title of the topic is monitoring from the ground up. So for me it's really understanding the performance of the system as it is now. And that's, to some extent you could argue and I would definitely argue that your monitoring system, whatever it is, is how you understand what's going on. The other way to do that is to ask a user maybe how it's going. There's just no other interface. You just can't go, even if you're on-prem you just can't walk to the cage, look at the servers and kind of understand what's going on. You just see the lights blinking. And so that's why for me monitoring it's sort of crucial. The crucial part here is understanding and you have to understand how it is now, not how it should be or how it will be in a month or something like that. It's really, there's a strong real-time aspect to it. Monitoring is also dealing with partial degradation or total failure and that's really the battle scars I was mentioning earlier, so that's pretty obvious. And monitoring, I think importantly enough and that's how kind of I got to these eight graphs that I watch is an iterative process. I think because monitoring is about understanding and then you don't come up with an architecture and then you understand it with all its performance from the get-go, you're sort of discovering things as they occur, as the input changes, you have more traffic maybe or less traffic or different patterns or different queries. You sort of refine your understanding and your monitoring, let's say your monitoring system ends up sort of encompassing, embodying your understanding of what matters in your architecture in particular what matters with respect to the performance of Postgres. So now it's an iterative process which means that and because you guys have all probably somewhat different situations in terms of how you use Postgres, what size, what hosting, what kind of queries you have, you'll have different outcomes but what I want to talk about is how you get to sort of a fast or reasonably fast but a good iterative process that'll get you to having just the key metrics to watch and then the rest you can ignore for the most part. Is everybody good with this definition? Anybody vehemently disagrees? Well if so, you can do, please do. What I didn't consider monitoring which is capacity planning is, which you can use the same data that we'll see, the same metrics, you can use for capacity planning but for me it's something different. Capacity planning is obviously all about forecasts. It's basically extracting from the data some kind of model and then having assumptions baked in and then cleaning up the data so your model can be used to predict what your performance will be either in time in the future or if you double let's say the amount of traffic to your platform, what the performance will look like. So that's sort of a different class of problem and thus the method I'm gonna talk about doesn't quite apply there. Monitoring I would argue is relatively hard because the systems we use are complex. Now if anybody was sort of new to the Postgres world they would probably go through these four questions which I'll detail in just a second but the reality is if you start using some new tool, some new data store, some new application runtime, some new anything, you're gonna go through, you're gonna have these four questions pretty much in mind. So first one is well what can I measure about the performance of my system or in the Postgres we'll get into detail. Second one important is what do I wanna be alerted on? So when, what should I watch and when should I be, when do I allow myself to be woken up in the middle of the night because something's wrong with Postgres or Component X if you're talking about something else? That's a critical one, what is normal? So with all these metrics that you can collect what did they look like when it's normal and then when something bad happens what do I use? And that's I think ultimately once you have a proper monitoring system in place and once you're on the, which really means once you understand how the system works you can answer all particularly number four. So my goal is using Data Dog as an example I wanna present a sort of rational method for designing a monitoring system if you will. I wanna apply that to Postgres in particular and you know overall I wanna share what we've learned. So this is really, you know why I'm here. In particular I'm not here to sell Data Dog if you're interested in buying, they're downstairs but that's, that's, yeah that's so. There'll be some graphs from Data Dog but that's it. Okay so let's talk about some prior art. So by and large the ideas that I'm presenting are not my own, I attributed them. So there are three I think fairly recent documents or ideas that have been presented which I think apply very well and sort of are the backbone of this method. They're all luckily available online. The first one is, it's called My Philosophy in Alerting by Rob Evershoek. He's necessary at Google. I very much, if you guys are on call or if you have people on call in your company I very much recommend reading this. Who has read it? Okay, definitely go and read it. It's totally worth it. Number two is what metrics should I pay attention to by Baron Schwartz, she's actually downstairs I saw her earlier. So that's a video you can watch, takes about 20 minutes. And number three is the use method by Brandon Gregg who's now at Netflix and you can, this is his blog, his sort of website. So I'm gonna kinda go through Cliff Notes so I don't, you know, I'm not gonna spend too much time on this but really to articulate how to go about this. So Cliff Notes on my philosophy in alerting where he's philosophy in alerting as I wear. The idea is to really answer the fundamental question if you have, if you're on call first, I guess who's on call in the room? Yeah, so maybe 50, 30 to 50%. The question is what should I alert on? Or what should I get alert, you know, what can happen in the midnight by? And the idea is to answer this question in with the intent to maximize robustness so you don't miss alerts that should have opened you up and you don't get woken up for nothing and also conversely to minimize burnout. If you guys, so for those of you who carry AAPJer, as you wear, you know what burnout means. For those you don't, be glad you don't have to face that. So in this case is basically, and I've been through phases like this where every night, you're on call, let's say for a week, every night at anywhere between two and four a.m., basically in the deep phases of deep sleep, your phone rings like something's on fire and you have to react. It sort of kills you, it kills your family, it kills everything, basically, it's dead. So what Rob said is there are basically two things. When you look at everything you can alert on and all that your monitoring system currently is collecting, there are two things. There are symptoms and there are what he calls causes. And symptoms are really things that, things like the application is timing out and causes maybe 500 or some kind of problem for the end user. So symptoms are something that really doesn't, it's like, I guess it's like a disease, right? You have symptoms that doesn't necessarily explain why you have the symptoms, but it's how you can tell something's wrong. And so the symptoms are sort of very closely end user and so on, the causes are more things like the classic monitoring stuff, like high load, low free disk space, that kind of, all the stuff that you can get very easily out of any machine, which he classify causes. So in the central message, and that's one of the key messages of first, his presentation or his document also, I guess my presentation is you wanna alert on symptoms. So the only thing you wanna alert on is symptoms. So you wanna alert if, for instance, there are no, the database is not answering any queries or if all the queries or a bunch of queries return with errors or if you look at a wider system, if you're, if no web traffic reaches, goes past your balance or something like that. You wanna be pretty high level. You don't wanna, and the causes are more like how you explain the symptoms. So maybe it's, there's no traffic to my, or the queries going to progress or return errors and the causes, well, there's no disk space left. So the thing just is basically stalled. Or it could be that there are no queries returning from my database till time out because, and the cause may be that the CPU is pegged or the machine is misconfigured. Now the, and the, sorry, I'll explain why you wanna go in this order and not in the reverse. Now the tragedy of monitoring out of the box for in most cases is that it takes the exact opposite. And the reason for that is that what's called in this case causes like CPU load, memory, blah, blah, blah, is very readily available. It's easy to collect. So by default that's what gets configured for alerting and that's bad because it's very wasteful. And I'll show you why. Okay, so Cliff Notes, I saved you maybe 10 minutes. I shouldn't have Cliff Notes for this guy because it's really important to read, but that's, so if you don't have time to read or if you don't wanna bother, that's what I extracted from it. Cliff Notes from the second document that I pointed out prior to what metrics should I pay attention to. It's a little bit, it's basically the same idea, slightly different terminology. It answers kind of the same question, what should I monitor as in what should I allow to be woken up by? So the message there is alert, so there's two things, there's work and resources and the message is alert and work and not resources. So work is, consider work is what the database is doing. So the database at the base level, it's storing data and then storing or dating and so on and then returning results of queries. That's the basic work of the database. The rest is just details. Resources are, for instance, CPU is a resource, memory is a resource. Number of queries that had shared buffer percentage is considered as a resource. It doesn't really do work. You don't buy database or you don't use a database, you don't deploy database so that your queries can hit shared buffers. It's just not terribly useful. Though you do it, you use it to store data. So one way to think about it is think of it as a factory. I mean, it's how 19th century, but kind of that's how I can see that at a very high level is, postgres is your factory, your machine. You feed it with CPU memory network IO and then it can, and data, of course, they'll come through in the network and then you'll get results out, you'll query. So the work is in the queries. The work is not in using this. Now the interesting thing with this approach of work and resources is that you can actually apply to the subcomponents of database. So you can consider that the work of the database is to return queries, return them in time and so on. Now each of the components, they also have their own factory. So the backend process it's gonna take, sort of, it's gonna use CPU and so on, it's gonna return some data. The wall sender also, it's goal is to send the walls. That's it. I mean, it's gonna need a network for that. It's gonna need to read the wall some disk. So it's gonna need some access to disk and so on, but that's all it does. So you can so recursively apply this decomposition for any metric that you're collecting. So from any of these systems, you can apply just the composition between the work and resources. Does that make sense? Who's not with me? All right, I'll take that as a yes. So again, the key message, which is very similar to the first one is, alert on work, don't explain with resources. So do not, I think the opposite is do not alert on work. Do not alert on resources, yeah, I get checked up. So why not alert on resources? So here's my graph that I showed you. So this is idle CPU, so it goes from 100, I mean the machine is either off or nobody's touching it, nobody's doing anything with it. Zero is the CPU is completely pegged. Now the problem is I could set an alert for 10% CPU left, that's great, but it could be that there's a bad query going on and it's just trashing the box. It could be that somebody's doing some, I don't know, somebody mining bitcoins on the box and then I'll get the exact same, basically the exact same profile. So setting an alert on a resource like the CPU is used, like great, okay, it's used. It's more a capacity planning problem than a monitoring problem. Because just setting an alert and say, this could use, now I'm like, okay, but I still don't know why. Maybe it's a good thing, maybe it's a bad thing. Maybe this is a bad job and weird, I don't know, let's say just crunching some report for the end of the month or something like that or maybe it's something bad and just, it's just not actionable. And so a lot of the burnout when you do your monitoring and you alert on resources like CPU, load is a classic, it's usually just wasting your time because it's not, that's not what should wake you up. What should wake you up is the database is not responding, there are no queries being returned, there's just a lot of errors, everything's timing out, that's the stuff you wanna be, you wanna alert on. Because ultimately that's the stuff that people care about, be it your boss, your customers, your fellow teammates and so on. Everybody call on why not to alert on resources? I didn't have a question, I guess. I'm waiting until something happens as opposed to catching something. Can I understand trying to look at the resource as a server but look at its own like file systems and where they fill up. So yes, yes, so I have one exception to the rule and that's, there are two exceptions to the rule in my whole presentation, they're file system and connections, the consumption of resources, it is very important if you wanna do capacity planning. So you wanna know when do I run out of this space or if we double incoming traffic, do we run out of steam in postgres or something like that. It is not though. No, go ahead, sorry, just one minute. Oh, okay. It is not though monitoring per se. I mean, because anything like I do something with a machine for five minutes, I can trip the alert because it's that easy. One thing I found is that sometimes even with resource monitoring is tricky because I don't wanna be woke up in the middle of the night because the system went to 80% plus one K and stays there all night. So a rate of change, a resource situation, a rate of change is also kind of important thing. If it goes from 80 to 90 in 10 minutes, you need to know that. If it goes from 79 to 80 over the course of the night, then you don't need to know that. You can fix it in the morning. Even though I would argue, so my two exceptions, all right, so I mean, my two exceptions are connections and disk. Look, if your CPU is busier than usual, maybe somebody pushed some bad code but the queries return in roughly the same amount of time. Yes, at some point, for more for capacity planning, I'll say, well, I wanna see kind of what the performance looks like today. Did we change phenomenally from last month to this month? I totally, you should totally look at that, but not as a wake me up because it's now 81%. So because what happens is 81%, so next morning I go, okay, I'll make it 85%. That's a classic for disk space. And then, well, just 95% and then it's game over. I've played this game, been burning a couple of times. Okay, all right, so that was another example of, I don't, something happens that this is the CPU graph. Basically, this is CPU and then the sort of the lighter blue is idle, so the horizon, the capacity we have from a CPU perspective. It kind of looked bad on the right, right? It's kind of something, it's maxing out. Turns out, and this is, it has very limited impact of it has a few more errors, but in the context of this particular application, it didn't matter. I mean, we get, anyway, these kind of error once in a while, even it's fairly decorrelated from CPU. So I wouldn't wanna set alert on 80% for this CPU because it would have triggered and nothing, frankly, there's nothing to report the services operating properly. So it's, I can look at it maybe on a monthly basis and see, well, this is where we're gonna need a little bit of extra juice if we wanna keep adding customers, but that's about it. It's not a real time, tell me something's, wake me up because something's wrong. Am I, did I sort of clarify a little bit, my point? Okay, cool, all right. So the two exceptions are, so I said in the case of Postgres, for me, it's mostly this space, it's a resource, but the problem with this space is when you have one byte left that'll work and when you have zero bytes left, it stops all of a sudden and it's fairly easy to predict. So it's kind of, you don't wanna get calls like, oh, well, yeah, we didn't monitor this space because just do it, that one, it's okay. The other one is number connections because there's some maximum in your Postgres configuration so you don't wanna run out because it'll cause back pressure and your application is gonna basically degrade fairly quickly if you'll run out of connections. So these are sort of shortcuts, but in general, I would say the resource of monitoring, alerting on resources, specifically alerting on resources, I'm not a huge fan of. Okay, the last cliff note, the use method, who's heard of that? Okay, cool, a few people. So this is more, okay, so you have the difference between work and resources, right? So that's work is what's useful and resources is what's consumed to produce that work. So you'll get alerted on work so the database is not returning any queries, for instance. Now, okay, so now you know you need to look at it. The question is how do you go about it? So the use method is a systematic review of all the resources to identify basically the bottlenecks and the errors. So use stands for utilization, saturation of errors. So it turns out you can basically map all the metrics you're getting into one of these three. And so having this taxonomy is very useful because you can do two things. So first of all, make sure you've covered everything. And also when you have a new metric, you can, it's a sort of non-categorize, okay, this is this new metric. What does it really measure? Utilization, saturation of error, or error. And so where you can apply use is for any resource, CPU, IO, memory, so classic resources, checkpoint or back-end process, you could consider them as the resources. So you can consider the back-end process, how much utilization is it, how often is it actually not in idle mode and when the connection is not idle. The wall sender is the same. You could look at errors, how many errors does the back end process generate? Check pointer has its own set of metrics and so on. The cool thing with this method is suggested interpretation because it's relatively simple is you're okay when you have low utilization, no saturation in few errors. And it's bad usually, and that's in general, when utilization is over 70% because of sampling, that means maybe you run out of steam, you hit 100% utilization where your resources is entirely consumed. Usually bad also when saturation is greater than zero, saturation as in things that your resource is waiting to process but doesn't have the bandwidth to process right now. So if things pile up for that resource, if the walls are piling up because they are all sent across the wire to your replicas. If the queries, if the PG Bouncer client connections pile up because there's no back-end process to consume, and saturation usually is bad. And errors is bad. It's more the change that matters there than a specific value. So if you have zero errors usually and something's happening, then it's usually fine. If you have a couple of errors but it's not increasing, then you're usually fine as well. I mean, errors are, I think, part of this universe is always gonna be like some noise. So you're gonna have some errors sometimes. But it's bad when you have a lot more errors. You have a symptom or you've been loaded on some work metric but your errors are growing. Then usually that's a good way to interpret that. Oh, this particular resource that I'm looking at has an issue. Okay, so this is sort of applying the idle CPU. So the use applied to CPU idleness, so in this case it's a measure of utilization. This idle is essentially a measure of utilization. It's, I mean, 100% minus idle is utilization. Load would be a measure of saturation of the CPU. So if you have a load greater than, all per core greater than one, that means there are more tasks waiting to be executed than there are cores available. Okay, so altogether, before we get into the sort of more concrete part of the stuff, so you're looked on high level symptoms of work. When you have a symptom, you just identify which resources is causing you trouble with that use method. If you see, now if you've identified the resource and it's not like just CPU but it's something at a higher level, like one of the processes here, progress, you could look at, well, what are the metrics for this particular piece of component that are called symptom and what are the metrics that are more like, sorry, what are the metrics that explain that point to work and what are the metrics that point to resources? And you have this recursive approach to finding exactly what the root cause of your problem is. Does that make sense? Okay, cool. And then you just stop when the high level symptoms disappear. And that's that. I mean, then you can do postmortem and fixings and so on. But in general, that's, you just go through this again and again and again. Okay, so I think, so what I'm basically positing is that that little recipe takes here to the last three questions of these four remains number one. So in practice, this, the source of things you can use for monitoring progress, I think come from the PG stats star, so a bunch of them, come from so slow queries if you enable the minimum statement duration whether to explain that helps come from the OS because at some point there's some resource that is measured by the OS, not by Postgres. And then you can add to it like Dtraces system tab if you wanna go like fairly, fairly deep in details. Okay, so PG stat. So I actually went through the page, I summarized so you could go there and you'll see a spreadsheet. There are 83, what we call metrics so things that you can, we'll yield basically a quantity of value and I've classified them based on my approach. So that gave me, I've classified as symptom resources of work if you will resources. And so I found 30 for symptoms at different levels. So there's like symptom level metrics for the database like there's no transactions going on, transaction zero database is either not working, not doing anything and it's in maintenance maybe or there's something bad because the transactions just don't complete. And then there are also, I guess you can measure to detect symptoms at individual components like the wall machinery, the checkpointer and so on and so forth. And then there are a bunch of other metrics for resuscitation, saturation and error. So if you go there, you'll see it's just a spreadsheet. You can comment, you can add comments actually if you, so please add comments if you care. Very quickly, so what does it look like? So for me, if you look at PG stat database, the stuff that I care about is are my clients okay? And so that's the database level. This may actually may not show up in the progress PG stats database per se, but at the level of database I care about do I get errors in my queries and do my query timeouts, do my queries timeout. So that's number one. So this is a very high level, like symptomatic view of progress. Are there any transaction going on and is query activity normal? Like, and this is more am I getting queries like select hitting the box, inserts, you know updates and deletes is that thing in line with what I expect. There could be any reason why this goes out, where this is out of whack, but at least this is the stuff I wanna be woken up, not whether we use a lot of CPU or not. For, from a replication also is something that I care about because of the way we run. So the two things in this case, the two metrics that I would call symptomatic or as a way to detect symptoms of problems are just the state of replications that we're streaming or not is running or not. And then just the replay lag, how far back am I? So this is one of the queries we use for that to generate the metric. I mean, it's nothing crazy, just see how far back we are. So the exceptions to the rule, the two exceptions to the rule, I guess, the resources that I wanna be alerted on and you, I've set up data docs that will wake me up in case we add a max out number of connections because I know that immediately problems gonna hit us hard or we will run out of this space, which is even in 2015, a surprisingly common cause of crashes and problems. So another, so these are the metrics you can get. For me, the key metrics are a PG stat star. The rest is in the spreadsheet that I presented earlier, you could have, there are all the details. I will use those when I'm investigating a problem deep down, but not, I don't alert on those cause I don't care, frankly, as long as the database is working, doing its work within the bounds I care about, then I'm fine. So, slow queries and another way to look at performance, it's a little bit, at least in my experience, this is a little bit different. So there's clearly, when I look at slow queries, when things are okay, what I do usually is we'll have a baseline of PG stat statement to see kind of what the execution of queries look like, how many they are, how long they take overall. It's almost another measure of utilization of my database. And I'll set the minute, I want to log the slow queries, I'll set it to either the SLO that I have for my database, or some kind of timeout. So I want to know, we're just under the timeout. So I want to know when, basically I'm not meeting my SLO, or when the client is going to timeout. I don't really like to set it much lower because there's always going to be something bad. You know, if you set it to one second, you're going to have a query that's going to take two seconds and it's going to start popping up. And it makes the log very full of stuff that may actually have no impact on the quality of service. So that's why I don't want to set it too high. And then I just set it to explain so I don't have to explain. So that's, everybody knows how to explain. So slow queries, it gets harder for me to use when you have, it's still useful when you have like a bad plan. There's just somebody rewrote a query and it's just somebody forgot to get an index or something like that. I still find it usable, but only to a point. If it's just one bad query, sometimes PG statement will flag it because they have the baseline, say, oh, this guy's total time is just shot up, even though the count of number of calls, you know, say an hour hasn't moved. So there's something different there. And maybe in the slow query, I'll see that particular query pop up. So this is a case where we have, sort of the big queries. And this is the queries that we run are very, sometimes a little bit pathological. So yeah, this is a very, very spiky pattern. And once in a while, there'll be sort of an error which may or may not be caused by this. But it's kind of under the threshold that I care about. And so, you know, that's fine. So the efficient slow query, I can use a slow query log to catch. When things get bad, I find, I personally find the slow query logs of PG stats statement to be completely useless, or not useless. The reason is when there's a bad, something sort of systemically, not systematically, but systemically broken with the database, everything will slow. So it will slow down so all the metrics will appear slow. And so your slow query logs is full of metrics, full of queries, of all the queries, basically because everything's slow. So the only way of being able to work around that is to really focus on the minute I get alerted is just look at what were the queries that finish just after that. Because anything after, you know, if the slowdown lasts for 15 minutes, usually after a couple of minutes, everything starts to be recorded as slow. And so it's just not, I have all the queries of my system that show up in my log, and I'm just wasting my time. So if you had cases like that where everything slows down for 10 minutes, between 10.35 and 10.45, and just slow query log is exploding. It's like everything is trying to be logged, basically. The harder case is, and overall with the monitoring is when the symptoms are not obvious. So we've had cases where the throughput is still fairly reasonable. But just to, it's a little bit slow. Everything is a little bit slower, but not crazy slow. But then you look at utilization. So it's more like you have that on the screen. It's like, huh, it's weird. Things are fine, but something doesn't feel right. So it's the case though, where I don't wanna, I'm not gonna look at that in the middle of the night. I'm not waking me up in the middle of the night to tell me that something is wrong, and even though everything looks fine, but the resource utilization is high, is not something that my wife's gonna like, because in the middle of the night, I'm gonna wake up, and if I sit on the earth on that, I'm gonna wake up and I go, well, yes, something is not right, but I'm tired, I'm not 100%. So even in this case, I found that the resources, I don't wanna get alerted on resources, I will try to understand why the system is behaving like that with the resources, but I don't wake me up. So what I've done there is the use method, look at everything, so resource by resource, utilization, saturation error, and just pull the thread. So one of the resource that I saw was locks, so the locks were way out of whack, and so, okay, well, that's not great, because that explains why the queries are slower, even though the quality of the service is still fine, but it's an indication, so I keep going, all right, what uses locks, and then you sort of pull the thread, and then you end up with a CPU, at some point, everything ends up there, memory, CPU, IO, network. And then I have these, what I saw in this case, I have these huge spikes in kernel, and really off the chart. So on a 32 core machine, we'd have 30% of the CPU cycles and running pushgres code, and then 70% running the kernel, the Linux kernel, and so, okay, you're there, and so now the kernel is my resource, and so what can I measure to, what are the key components? You sort of start diving into the kernel layers, and finally, we found two things which were kernel related, not surprisingly, so I just put them there because if you have a situation, we have high kernel timings, so person CPU spend kernel is high, high locks, but nothing else is showing up weird, maybe that's that, and I can save you time, we spend, I think, three weeks chasing after those, so in case it's helpful, there it is. Received size scaling, which distributes the amount of network, the network stack in the kernel by default executes on core zero, this lets you distribute across multiple cores, and scheduler migration cost is just this thing where the reason why the kernel is so busy is because it spends time moving a piece of a pushgres thread from one core to another core to another core to another core, the more core you have, the worse it gets, so that's sort of an interesting piece. So with that, I almost want to say that monitoring is not hard after all, but at least it's still hard, but I think we've covered most of the, we know which metrics are available, that's in spreadsheet, we know which ones are to alert on, we have a sense of what's normal, or at least what's normal is what it looks like when there are no problems, and then when something bad happens, how do you get, how do you go at it? And that's the use method, so the takeaways that I have for you guys is, monitoring is hard, but you can be rational about it, it doesn't have to be, it's not an art, I think, I hate that when people say it's an art, no, it's not an art, it should be more science than an art. You want to learn on symptoms, and maybe you have to learn on symptoms that are outside of pushgres, and you don't want to learn on resource except disk space and connections, you want to use the use method to find which resources that power pushgres are causing the issues, and then your last is you want A to A, D to A, D to A, and after, when you go through this, who is go through this, then you build a solid understanding of what your pushgres is doing, what the application at large is doing, and then that leads you to decrease the number of problems, and so then you can love monitoring. And that's it? Maybe that's a six take away, yeah. Any questions? Okay, so go ahead. I was wondering if you could say parsing a log file. Yeah, so it ends up basically parsing a log file, so you can use, what we end up doing is we load that into another database. I don't want to log my slow query log in the database itself, but you know, and then run some SQL against it, and basically trigger around the window when the proverbial feces hit the fan. Just look at what the queries were running, and which one were running slow, and then that's how I would... Yeah, no, yeah, it's not, I mean, you could, I can offline, I can, you know, some ideas to automate, but. Yeah. Yeah. Which is, that doesn't tell you if the slave disappears if you're not. No, there's a, that's why the other one is a state, and so, yeah, but true, absolutely. Yeah. Yes? Yeah, my kind of earlier question about rate of change. Yeah. You know, for resources like disk, where I want to alert when it gets to 80% but only if it gets there quickly. Yeah. Otherwise, I can do it in the morning. Do you have anything to kind of handle that, how do you handle that scenario? So, that's, then it's more like the forecasting aspect, because essentially you want to know, do I have until the morning to fix the problem, or is it, or don't? I, so in data per se, we don't have at this point the way to do it, otherwise it do just a linear regression on what it looks like, or the one sort of hack we have is, we sent, so if you set threshold 80%, I'll send you a little snapshot of, this is what the consumption looks like, or actually you can alert on rate of change if you want, the thing is you have to alert on rate of change, and also the fact that it's 80%, so you want to know, because if it increases fast, but it's just somebody's dump, I don't know, doing a PG dump, and there's still four terabytes of disk left, and your database is 500 gigs, then you're probably okay. I would want both rate of change and pressure. Yeah, yeah. If it goes through from 20 to 40, I don't care. Yeah. Quickly, but if it goes down to 80 or 90, then I care. Yeah, no, that makes total sense. I mean, yeah. Yeah, okay, great, well that's all the time we have, so, thank you very much.