 Well good morning, thank you for their introduction. So yeah, I've been doing this for a really long time, usually longer than I care to admit. And I've done all sorts of cool stuff. I work for government, I work for startups, I've worked for enterprise companies. I've touched the code problem in the sites that all of you go to every day. But what I realized is all it means is I've been fixing other people's problems really for a really long time. And so this talk was somewhat born out of that. But on top of that, there was another fact that really played into me talking about this. And that is a cloud really ruling everything. It did, it really did. Because we all got used to the whole mentality of destroy and rebuild, right? And that's fantastic. But it is vaguely familiar to an endowed reboot that we all talked about 20 years ago with Blinda was that we tried to get away from. The whole idea of servers being our cattle, everybody know the pets versus cattle analogy, right? Yes, good. So it's awesome, right? The cattle analogy is great because you have a whole bunch of servers, a whole bunch of cows, and if one of them gets sick, you take it behind the barn, you shoot it, you buy a new cow. That's awesome. Except when you get a mad cow disease and all your cows get sick and you don't know why and then you output no cows, no farm, and a huge mortgage in both of them. So in order to prevent that, you actually need to understand why your cows are dying, right? At some point, you may shoot it afterwards, but at least you've got to figure out what the symptom is. So trouble shooting as defined by Wikipedia and Willow Trust Wikipedia, it's a form of problem solving, right? You try to solve the problem. And in my definition, problem solving is an ability to fix things that you know absolutely nothing about. So why is it important? Why should we even care about this? And mainly it's because the systems we work with become more and more complex every day, right? We're not really building enterprise yet, but we're really getting there in some ways. And those systems break. And they break in the really spectacular ways. And what's worse, they break in a way, in the spectacular ways when you expected the least. And what happens then, right? How many of you here have a luxury not to have every one of your mess ups to be instantly seen by half of world? Like in other words, who doesn't work with web? Oh, there's one guy like in a bag who is lucky enough not to have that. For the rest of us, any mistake that we make is made available to hundreds, thousands, millions of people, depends on what you work with. And those people are generally not happy. And probably most important though is I know some of you may think that your job is to write the coolest software or whatever be, but in reality, your job is to fix code. Whether it's somebody else's code or code of you who wrote it a week ago, which is more likely. More likely than not, you try to fix what's broken, right? You try to keep the stability. So I wanna make one quick disclaimer before we go forward. When I mean troubleshooting, I don't actually mean picking up a ticket from a backlog, just going and fixing it and pushing it to production. I know we all wish to get a bug and sit there with a latte in the middle afternoon trying to figure out why the page is also lower than usual. In reality, when I talk about troubleshooting, which means everything is on fire, it's three in the morning, you're on fire, and the water buckets are full of gasoline, right? I'm talking about that kind of troubleshooting. So when you're in that situation, where do you begin, right? What is the first step you do when there is that problem you get woken out by a page or by an anger customer, which is even worse? Where do you start? And ironically enough, you start with replicating the problem. And I know a lot of people miss that part, but replicating the problem is really, really important because the input that you get about the problem is not necessarily the right one. You cannot imagine how many times I had a conversation, like after reviewing the code on push, and I'm looking at it, I'm thinking, well, this doesn't make any sense, right? I'm looking at the code, and I'm looking at the problem description, and there is no way they should fix this problem, like there is no way they should fix any problem really. So I go talk to a developer, I was like, hey, are you sure this fixes the problem? He's like, absolutely. I was like, how do you know? I'm like, well, it runs after this. I was like, did it run before? He's like, what do you mean? Let's talk. So once you actually, well, you should actually identify the problem, you need to isolate the problem, right? Because I mentioned that the systems are really complex, right? There is a lot of components to them, there is a lot of moving parts, and you need to figure out which part is broken. And it's not necessarily the part that warns you about, right? It's not even part of the error.out or send the page, it may be something completely different. So let me give you an example. By the way, all the examples I use here, no matter how ridiculous the sound are real examples, they happen to me. So keep that in mind. So this is an actual email, I got it from a customer as a bug report, said logins aren't working 100% of the time, alerts are going off periodically, and today the system didn't send out the scheduled emails, right? Pretty common or something on this line. So as an engineer, we parse it down to information that's important to us, which is logins aren't working and the system didn't send the scheduled emails, right? So immediately you get this email, you validate that it is indeed a problem, you get a chance at errors, and you go look into a login system, you look at SSO, you look at the email schedule, and you try to figure out what's wrong with this. Problem with that is, you're looking at the wrong part of the statement. The part they should be looking at aren't working 100% periodically today, right? Which indicates that the problem is not really in the systems that customer noticed, but rather there's a transient error somewhere that breaks different parts of the system periodically, right? In this particular case, for example, one of the nodes was out of this space. So whenever the processes would hit that node, obviously they wouldn't work, and that's why it happened periodically. But if you don't think about it that way, if you don't parse it out to really get to the root of the problem, you're gonna be going down the rabbit hole and trying to fix SSO when there's nothing wrong with that, really. And yeah, so once you get through all that, the third step, of course, is to fix the problem, right? That's, well, if it was, it would be the shortest talk ever. Because at that point, even when you isolated the problem in the way you identified the problem, you're still left with a simple question, right? You know what's the problem, you know it's broken, right? You confirm that it's broken, but you still don't know why. So you need to really understand what is broken. And actually, for the next probably a few minutes, 10 minutes, I'll be talking about this word, understanding. Because you really need to understand a lot of things when you're trying to travel through the systems, right? So the first thing, as we talked about, would be understanding the actual problem, right? And I can't stress enough how important it is to understand the problem you're trying to solve before trying to solve it. So many times I hear something along these lines, right? We can't support 100 requests a minute, we need to scale better. Sounds reasonable? I mean, yes, it should solve the problem. Except for 100 requests a minute means less than two requests a second. If you can't serve less than two requests a second, you don't need to scale. You need to improve performance. Those of you who don't know the difference between scale and performance, see me after this talk. Because if you try to scale it, right, you just throw money at the problem, like literally throw money at the problem without really fixing it. For the more visual of you, let's look at this graph. There is a problem there somewhere. Anybody can point that out? So, obviously, right? Somebody pushed the code, somebody did something that skyrocketed like the latency on a low terms. Except this is not the problem, right? If you look at the numbers, maybe they're too small. The performance degraded from about 200 milliseconds to 600 milliseconds, which is 200% decrease, so it's pretty bad. But 600 milliseconds is still pre-reasonable, right? So it may be an acceptable risk. This is the actual problem, right? The performance continued to degrade over time, which got to over a second at some point, which means if it hasn't been thrown timeouts to users at this point, it will at some point. All right, so what you actually need to fix is not the code that actually increases the performance, perhaps, because that may be a viable change, but the problem that keeps on growing and growing and growing. Because that will affect, at the end of it, affect the business itself, right? They'll affect the user, still affect the user experience. Which brings me to my second point that you need to understand. You need to understand the business. I have this quote that I've heard by a customer of mine many years ago that was still, by far, the best quote I've ever heard, and I probably use it in most of my talks because there are points to literally everything. Business people don't care about technology at all. They care about technology supporting their business needs, right? So honestly, I can talk about understanding business for hours, but I wanted to use the simplest example that everybody would be familiar with. 404 error, right? Everybody had those, everybody suffered from those. So what does that actually mean to you? I don't know, it's a page not found, right? Which means something broke, can't load the page, whatever it be. What does that actually mean to your business, right? So Amazon, if Amazon's homepage goes down, what does it mean to them, right? What does that affect the most? Anybody? Money, money, sales, right? I mean, the widget that shows recommended products on Amazon generates like 27% of their revenue, which is ridiculous, right, considering how much money they're making. So if a homepage goes down, the first priority should be to bring out that widget, right, that is right there, generates the most revenue. But that's a simple example, right? Let's bring something a little more complicated. Wall Street Journal, if Wall Street Journal goes down, what does it affect? What is the main thing that they can do? Advertising, it is not content, right? Although they're a content company. It is the advertisement, it is the ad revenue. They serve 32 ads on a homepage alone, 32. So you know what, if their homepage goes down, perhaps their first goal is not to show them the freshest content to the users, it is to bring out perhaps stale content, but with advertisement up, because every second they're losing money, because visitors come into the site. So whenever you try to establish it, whenever you try to build something, wherever it be, but even in a troubleshooting case, keep in mind that every technical decision powers some sort of business, right? So you need to understand the need in order to come up with the best solution, or even knowing what the solution should be, or prioritizing the solution in order to meet that need. And which brings me to my third understandable point. We need to understand the impact, right? Every time you touch something, you need to understand the impact of the change. And again, the more complex the system are, the more difficult it is to understand what is going to impact every time. But talking about impact and prioritizing, you need to be able to map the changes you're trying to make to that business need, right? So, and sometimes that means breaking stuff. Like it means legitimate to breaking stuff. I had a situation, this is actually a graph, I had a situation where a company was getting hammered during the holidays, like completely hammered, unexpected levels of traffic, great promotion, I mean great for the company, but the low times were ridiculous and high, right? At some point it become unbearable, so people were dropping out, people, users were getting timeouts. So if you have a performance problem that you can't fix, what do you do? You hide it, right? That's what Keshen wasn't made it for. So I was trying, it said it's fine, we're gonna pull it to a level of Keshen in front of it, we're gonna hide the problem, we can deal with the latency, let's do it. Problem is that for one reason or another, the way, they use Apache traffic server as their front end proxy. But the way it was configured, it could not serve non-SSL pages properly. There was some rerouting rules, it was extremely complicated. In about 20%, probably it's about 12% of the website was non-SSL, so effectively I couldn't put it behind the cache just like that. But you know what, I put it behind the cache anyways, because I was willing to break 12% of the site in order for the other part to work and actually serve customers properly, right? Because time is literally money. So I used to have big projects where I would run the query to show how much money the company makes every minute to show the developers what would the downtime cost every time they break something, which is an awful way to learn, but at least it gives you an actual perspective, right? It's not a theoretical, oh, we're gonna lose that money, it is physical money that you're losing every time the page is down. And the funny thing is marketers and business people knew this way before us, right? The 80% now is better than 100% tomorrow, who heard that mantra before, for marketers. Just a few people, surprising. But what actually means is usually marketers are willing to push something today, knowing well that it will break for 20% of the users and then incrementally fix it by tomorrow to be 100%. And the reason for that is because this way they get 80% of revenue today and 100% tomorrow versus fixing it until tomorrow and only get 100% of revenue tomorrow, right? It's a numbers game, which is very difficult for a lot of developers, for a lot of apps to understand. And what I just said is incremental improvements, right? It's one of the other keys to troubleshooting and to fixing problems, right? It is not black and white. More often than not, there is a lot of shades of grain between things being broken and things working optimally, right? I mean, if you look at an enemy of a problem, this looks like a pre-standard curve from when something goes wrong, right? You have a norm, right? You have a standard threshold for something which was great, then something happens whether you push something to production that you shouldn't or more people come to the site or there is a marketing campaign and then you have a problem, right? And then you get paid in the middle of the night, you start working on this, you fix the problem, everything gets back to normal. What a lot of people forget, I guess, because it's pretty common knowledge is that an acceptable threshold, which is much higher than a norm usually, right? So going back to the example that I gave, 200 milliseconds is a fantastic lot of times, right? But is 400 milliseconds acceptable for a short period of time or 600? The answer is probably yes, right? As long as it doesn't majorly impact the user experience, it's probably acceptable. So in this particular example, there was four different fixes that went into production before it actually got to norm. But after first fix, the user experience got to acceptable level and the business was still operating as usual. Right, so consider this, it doesn't have to be all or nothing, right? If you have even a short-term fix or incremental fix, do it, right? That's why we talk about continuous deployment all the time, right? So you could push small incremental changes. All right, so what have we learned so far, right? We need to understand what's important. We need to understand cause and effect and as such, we need to understand what impacts the business mostly and prioritize it based on that. And we need to take acceptable risk once in a while, right? We always gotta assess the risk between pushing to production, between making the change and wherever it be. So that's all great. And now let's talk something a little more fun. Now let's talk about what not to do when you're troubleshooting, which is gonna be significant or more rented than it was before. So first of all, like I said, don't assume, right? I mentioned it before and I'll probably mention it again in the next few minutes. Don't assume you know what the problem is and you know what the solution is, right? Correlation does not co-acquisition, right? If you wanna prevent people from drowning, don't try to stop Nikolaus Cage from making movies. By the way, that's a fantastic website. If you wanna go there and check it out, they've got a lot of awesome stuff. But yeah, correlation doesn't necessarily co-acquisition. So think about what you're actually trying to solve. Funny enough, don't trust the errors. Errors were written by humans and humans are faulty. I had a situation which is funny. I got a call from my bank saying, there is something wrong with your account, please come in. I was like, okay. So I come into the bank and a really nice gentleman said, oh, there is a problem with your birth date. I was like, that's funny. I'm pretty sure I haven't changed that in at least 40 years. I was like, okay. And he goes in the system, he logs in, he logs in my profile, goes, oh, I see it. Your birth date doesn't match the one on the credit score. And I'm like, again, that is really weird because I'm pretty sure I haven't changed that. And he keeps looking for another five minutes. I was like, oh, I see what the problem is. We don't have your birth date on file. So as a developer for many years who wrote a lot of crappy code, I understand how no does not equal something is really doesn't match. But it doesn't solve my problem, right? It didn't give me the right information in order to give a proper answer. So don't trust the code. I mean, don't. On the same token, we were talking about humans a little bit, troubleshooting is a stressful time. And it doesn't necessarily bring out the best in people, to say the least. There is a couple of, like when you troubleshoot, especially troubleshoot with people, there's always one or two, hopefully none, but the usual is one or two people who tend to complain more than to try to fix something. And there is a couple of excellent excuses that people give while trying to troubleshoot. This would be one of my favorite. It's not documented. You know what, at three in the morning when everything is on fire, I really don't care. You know what you should do? You should create a ticket, assign it to yourself and work on the first thing in the morning to document this. At this point, read the damn code. Even better, I didn't build it. Like, again, I don't care. Like, I don't care who built it at this point, I just care who's gonna fix it at this point. It passed all the tests. How many people heard people say that? Literally, everything is broken and the answer is like, oh, I don't know why it's broken, it passed all the tests. Well, I got bad news for you. Perhaps you should revisit your test suite. And by far, by far my favorite one is everything looks right. Like, you know what? Clearly it's not. I mean, it is the same as like, it works on my laptop. But generally, that answer comes in from the fact that people give up quickly, right? People go in, they look at something, they check the logs, the web page, they're like, oh, everything works. I don't know what's wrong with it. Oh, the pages are blown up, like the graph shows, all things, but everything looks fine, right? Everything's fine. So don't stop troubleshooting even if something comes up naturally. And on the same token, conversely, don't cling to the mistake because you spend a lot of time making it, right? A lot of people tend to, like be very protective of their own code, for example. So if you spend time on a code, they push that out and they're broke horribly, they will go out of their way to try to prove that it's not their code that's broken, which is awful because you're wasting cycles, you're wasting time, and in the meantime, the whole business is down, right? On the flip side of this, people think they have a solution and they work on it and despite being proven over time that that might not be the solution, they're very hesitant to just jump it and go in a different direction. So don't be afraid to change directions there, like you're there to solve the problem and not to feed the ego, right? That's the whole goal, like you're there to solve the problem. You can argue about it, you can argue about the documentation, whose fault it is, have a post-mortem, have a retrospective. Three in the morning is not the time to do this. And one thing that people always forget is don't be a hero, right? It's a team sport, as for help, right? If you are of your element or you think in your second pair of eyes, that's always great, as for help. General people are very receptive to that kind of thing. All right, so one of the things that I wanna talk about a little bit is tools. And I'm not gonna talk about particular products or whatever it be, I just rather wanna talk about the types of tools that you should be familiar with that would generally help with the troubleshooting. And there are three of them, logging, monitoring, profiling, which can probably be all wrapped in under the term observability. So logging, logging should be actionable, it should be concise, and should be parsable, right? And who's familiar with the log levels? Awesome. Who have them implemented in production right now, like those log levels? Well, less people, but still. Stop doing that. The only levels that you shouldn't print the production are error and fatal. And there's a couple of reasons for that, right? First reason, of course, is readability. When you need to go through the logs, like not the machine reading of those, but you actually need to go through the logs, you need to be able to find the information you're looking for quickly. So as an example, this is a log entry for a single request for one of the processes running. And it's fantastic, like it is awesome. It has the UIDs, it has all the data that we're getting, it had every step completed and whatnot, it has the timing. I mean, it's great, right? If I read through it, I'll get a full picture, I'll make sure every step completed properly, that's fantastic. Except for when I'm troubleshooting something, when I'm trying to find a root cause, this is all I care about, right? I care about the idea, I care about error. Everything else I can pull from other places. So it's much easier to digest and find the information, right? Chatter logs are never easy to parse and read. And another thing that a lot of people don't think about, verbosity is expensive, right? Logs are expensive. Like, that log that little snippet of code is 2K. Like, I mean, do the math, 100 requests a second, 60 seconds over two web servers, let's say, if not seven web servers, like you're aggregating max and max and max and max of logs a minute. So don't do that. Be selective with your logs, make sure that you can extract the information that you need when you need it. Unlike monitoring, you should monitor everything. You really should collect data on everything. And by everything, I mean business, I mean marketing, I mean systems, I mean APA, anything you can imagine you should collect information on. And I'm huge, huge proponent of business first monitoring, where the most important things are business related and everything should be supported in that. And it's such you need to be able to correlate it back to it. So again, I'll give you an example. Had a customer call and always starts with a call, right? And tell me we have a problem. I was like, okay, what's the problem is? And he was like, well, our revenue is down. I was like, okay. It's like something wrong with the system. Like, all right, let's take a look. So look at the graph. And luckily we had a metric for revenue. So yeah, I mean, if you look at the graph, there's clearly a dip in the revenue. So I looked through this and I was like, oh, I have to take a look at anything wrong with the system. So we'll look at the user performance, right? Maybe there's a latency like users can check out, whatever else be, looks pretty normal. We go through a couple other things. Database is generally a performance factor. We'll look at the database. Again, it's pretty normal, everything. And again, those are just a few. Like, you go through tens and tens and dozens of different metrics trying to figure out if anything's wrong. Everything looks fine. And I forgot to say, the company itself is a marketing company, right? They've been doing like, they had online entertainment. They had about 100 million users and send about 70 million emails a day, right? Because they're a marketing company, they have different campaigns, they have this, boy, it's all that stuff. So we keep on going down the shade and everything looks fine. And at some point I'm like, maybe you should talk to yourselves, people. I mean, really, like writing this down, everything seems to be ignored. But we continue digging a little more. And finally, we got to the email metrics, which we also are currently collecting. And apparently, one of the major ASP providers, accidental blacklisted them. Yahoo, I forget all however it was. Which means less people got the email, less people clicked on the link, less people went to the website, less people bought stuff, less revenue, right? So remember how said correlation does not cause causation? Don't underestimate the correlation. Like make sure you collect, we had a whole conversation yesterday in open spaces about it yesterday. Make sure you're able to correlate the business metrics, the ones that are important, to the metrics that you collect that support it, like system metrics, user metrics, performance metrics. And finally, I want to talk a little bit about profiling. And profiling is when you have the what, but you still have no idea why it's broken, right? You know what's broken, you know where it's broken, but you still have no idea why. And that would be any tool. There is a number of tools, obviously. It's your PR sets and tops, that's your Vellgr and Cashgris. There's a number of different profile tools that you'll use, Dtrace, Strace, right? So give you another example, the last one. I had a situation where the service ran out of CPU real quickly, like the cycles were spinning, it was really, really bad. And just looking at the distribution, it clearly was Apache, right? Apache was getting on the CPU. Problem is, I had no idea what process, right? I had no idea whether it's a particular page, particular process, what was it? So a colleague of mine, without a quick Dtrace script, they gave me this list. It basically parsed out all the requests and gave me CPU consumption per URL, which is amazing, right? It's awesome, because looking at this, like there's clearly the get all requests, you're all the CPUs, significantly, right? By a factor of 10, if not 100 over normal pages. I mean, fantastic may be a wrong word, but I was excited that I'd be able to get this information because with this information, I can go and go through a code and try to troubleshoot that particular page, right? And create an immediate help. The other problem is once you get this information, you go down the rabbit hole and then you have to troubleshoot the page, but that's a different problem. All right, so troubleshooting is a required skill, right? I think we'll agree on this. I think, I hope we'll agree on this because otherwise we're in trouble. It is educational. You learn a lot of things as you troubleshoot systems. A lot. Make sure that it's iterative, right? You can fix everything at the same time, but you can make continuous improvements to the system. It is extremely frustrating. Every time you have to troubleshoot something like any of the examples, it is annoying. But it is also extremely rewarding, right? Because hopefully rewarding because at the end of it, you can say, hey, I solved the problem in a real creative way. All right, that is all for me. Hopefully it was helpful. Thank you. I'm really intrigued by that CPU per URL example. Can you talk a little more about that? I can. It's a DTray script that basically attaches to the probe. We were running it on Illuma's base systems. So it had a full DTray support. So it actually has a code on the slides. I'll publish the slides and it's got a GitHub link. It attaches to basically a little script that attaches to the HTTP processes. It pulls out all the information. They can process it out and show it in a digestible format. It's real simple, actually. It's pretty awesome. Are there such things as design patterns for debugging? So things that might be common from case to case to case that you find you can reuse? I guess the short answer is no, because you never know what the problem is. I think there are best practices, right? What to do, some of the things that I talked about. Yeah, I mean, general things I just talked about basically validate, replicate, isolate and then try to fix it. And when you get to the part of trying to fix it, it's gonna be different every time, right? Because it's troubleshooter. You can never estimate it, you can never predict it, you don't know what it's gonna be, right? Up to the point where, like I said, up to the point where I knew it's a CPU problem, it was all, you can outline what needs to be done, right? Once we get to the point where a CPU problem coming from a patch, then it was anybody's game, right? Because you gotta get much deeper to figure out what it is. You were mentioning do not give up when you're troubleshooting and I wonder, when is a good time to give up and say, oh well, we have the next version coming or this might be fixed in a week. Do you look at the business or do you look at the software? So those are actually two different questions, right? Questions whether you can look at it in a week or you can look at it in the morning should be asked regardless, right? Because there is a whole conversation about alerting or cementoring, right? So you should never alert on something that you don't have an actionable response for immediately, right? Because if you're walking up in the middle of the night and you can't do anything about it, why did you get it walking up in the middle of the night? Question is whether you can fix later or whether you can pass it on to somebody. Generally, it's a business decision, right? So you go in and if something is completely broken, you have to come up with a solution. What that solution would be is depending on your business needs, right? So if you're a media site, for example, right? What I talked about is showing stale content acceptable solution in a short term, right? Well, it fixes some nails. And maybe the answer is yes, right? Then you bring up the older content somehow and then try to figure out what's wrong with it. Maybe the answer is no. If the answer is no, then again, you try to fix it as soon as possible but when you realize it's gonna take you, let's say, hours, if not days, then you go back to the business and say, you don't have an option, right? So here's the options B, C, and D that you can choose from. There is no more A kind of thing. Yeah, so it's case-based case but it does base on the business, right? It does base on the business. And like I said, sometimes fixing is breaking, right? Sometimes you need to break something or disable portion of the functionality which is actually really common, right? It's a particular piece of functionality breaks everything. It's, at least in my experience, it was very common to just turn it off, right? And that functionality can be critical sometimes, like it could be ads. It can be like CTAs on the page that show you all the immediate content, right? It can be login, right? So many sites nowadays, even when they do maintenance, they turn off anything right related, like logins. So it depends on the situation but it should come from the business side of things. Yeah, so the question is this, when do you make a decision what's acceptable risk and what's acceptable norm when you're troubleshooting? So when do you decide that you incremental and prove it enough that it's acceptable versus getting back to the base optimized line? It actually, again, goes back to the previous question, it is based on business, right? It is in business acceptable risk and whatnot. Yeah, so ideally you have those defined before that. Ideally you understand business enough enough to either make the determination or have somebody from business to make the determination for you. Basically, I'll lay out the options and saying, hey, we brought it down to an acceptable level, like, page doesn't come out for your customers anymore. It does low, slow, some images don't load, for example, but it does low for your customers, right? We now brought it to stable state. To get it to 100% stable state, because of X, Y, and Z, it will take us probably 24 hours. I need my people to go to sleep now because otherwise it's gonna be 48 hours because nobody works better without sleep and present those arguments. Ideally, though, you have enough information about the business to make a lot of those decisions. I mean, obviously not all of them, so you can make a decision what's more impactful turning off ads and turning off content unless you know it, right? So some of those decisions have to be made by business, but some of the decisions may be purely technical, right? I mean, if you see database load or CPU load at 80%, it's probably wrong. If you can get it down to like 65-ish, and do you think it can survive through the morning? Like technically, right? And if the answer is yes, you can call it a night, get some sleep, and try to fix it in the morning, that kind of thing. So in that example, when I showed the CPU stuff, right? If I fixed the biggest offender, which was like a factor of 100 over the norm, right? If I fixed that, would it bring down the CPU consumption to a level where it might not be the best, right? It might not be what it was before, but it'll be to the level where the pages would not time out or not take 12 seconds to load. And if the answer is yes, then I was like, fine. I can go through the rest of the pages and optimize them tomorrow, the next week, whenever over the next week, I get them done. So it depends what the problem was. My question is this, you've done a really great job about talking about troubleshooting as an individual, things that can contribute to an individual being more efficient, more effective, ways to do logging that would make it easy for an individual. Can you speak a little bit about what can be done around team troubleshooting? Because sometimes you get in a room with a bunch of other people. What are effective things you can do as a team to improve your troubleshooting? So actually the same concept supply, actually when you work with a team, like as long as you don't whine, like I talked about, generally when I troubleshoot with the team, I try to distribute the areas of search, right? So if we have different skill sets, it also depends on the skill sets of a team, right? So if I troubleshoot, I have development background, so I'm mostly comfortable troubleshooting on top to bottom, not bottom to top. So if I'm working with somebody from operation site or heavy operations site, I'll have research the CPU stuff while I look at the code. If I'm working with somebody with the same similar skill set or a group of people similar of skill set, I will try to distribute the areas. So say I think we have a problem, potential vectors of problem in these three areas, let's investigate our own, but continues to communicate, right? Chat, sit in the same room, whatever that be, continues to communicate because every time somebody finds something, it may impact what other people are doing. So that continuous communication really helps. I mean, ridiculous it helps actually. Because when you see this and somebody said, hey, it is something that impacts you and you look at it and you say, oh, it just improved by 20%. I'm not really sure what you did, but thank you, like kind of thing. So yeah, I mean, it's pretty standard things, like nothing extraordinary that I can recommend. They just talk to each other, be nice to each other. Awesome, Leon, thanks so much, this was great.