 Hey, good afternoon. Really happy to have you all here, especially at the last session of the conference. I think the DevOps track saved the best for last. So my name's Amin. I'm here to present Infrastructure Troubleshooting Secrets Revealed. So let's get started. So who am I and why am I qualified to talk about this stuff? So yeah, I'm Amina Stani. I'm the Senior Manager of Site Reliability Engineering at Acquia. I've been at the company for seven years. And the first five of those years, I served on the operations team, which is responsible for all the infrastructure that keeps our customer sites alive. I've been on call while on that team countless times. I've been paged countless times. And after getting paged, I would end up contributing to incident response tooling and process to make the pain significantly less. And then over the past two years, I built Site Reliability Engineering. In Vienna, I did a talk about building out that team. And over the past two years as well, I've been building out an internal DevOps process and initiative to improve quality of life for the company. So we have an agenda. There's a lot of content to get through. But this is pretty much what we're going to talk about. We're going to have a brief icebreaker. We're going to talk about a process called the use method, which is pretty cool. We're going to talk about hardware and software and the relationship to the performance of Drupal websites. We're going to do something very cryptic called process introspection, which is actually pretty cool when we get to it. And then finally, we're going to take all of this information, wrap it up, and talk about outage scenarios. So we're going to apply what we just learned. So it's 3 AM. You're sleeping. All is well with the world. And then you get a phone call from your boss. The website is down, the boss says. I don't know what to do. So you wake up, you wipe the sleep out of your eyes, you get in front of the computer, you remember SSH in to one of your servers, and you're like, where do I start? So raise your hand if you can identify with this poor dog in space right now. Wonderful. This talk is for you. So here are the goals. We together are going to gain a basic understanding of the infrastructure, very basic stuff. We're going to learn very simple set of processes and tooling to gather information about what's going on under the covers, and then we're going to learn how you can use those tools to identify pain points in your infrastructure in terms of performance. So we are taking five years of experience that I was in operations, and I'm cramming it into one hour. So I understand there's a lot of content here. So I will provide you the slides. I'm going to upload them after the presentation, so you don't have to take copious amounts of notes. Another thing, if you're kind of like me in myopic, you might want to sit towards the front because there's some command output that you might want to see. OK. So let's begin. I'm sure all of you know people that are system administrators, or systems engineers, or DevOps professionals, or whatever they call them. And you might actually have some misconceptions about these types of people. Some of you might think of them as wizards. Like they have some magical, esoteric, arcane knowledge about how this stuff works, and they do one thing and leave, and then everything just works like magic. Or you might think that they are more machine rather than human, and they somehow are able to talk to the computer in its native language and fix things and understand things. This is not the truth at all. The first secret, the big secret, that I wanted to share with you is that these types of people are human. And what's really happening behind the scenes is that they have tools that they know about. They have processes that they use. And their experience has allowed them to decipher patterns, which allow them to create heuristics and things that they do to quickly troubleshoot issues. And these are all things you can learn as well. So some housekeeping. We're assuming a few things. These are GNU Linux systems. This is LAMP stack. We're assuming that the audience member has some basics in CLI. They know what a command shell is. They've run commands in it before. And you have SSH access to your infrastructure, or at least access to your VM for those that are playing along in the audience. So to really get into things, let's talk about something called the use method. Well, what is the use method? Brendan Gregg is a performance engineer at Netflix. I'm a big fanboy of this person. He is very, very smart, very knowledgeable about systems. He wrote a gigantic tome this thick about performance. And he created the use method to teach others how to solve performance problems quickly. And he wanted it to be simple, straightforward, complete, and fast. And that's the link to his website that you can look at later. Well, the use method is simply this. For every resource, check utilization, saturation, and errors. Now, let's go over the definitions of each of those terms so we completely understand what the use method means. So what is a resource? Well, resource can be several things. It can be the physical hardware components that are in a computer, like CPU, memory, storage, and network. It can be the software components. So for example, the process like the PHP PROC pool for FPM or the NODB buffer pool for MySQL or VarnishCache, these are things that are consumable. So those are resources that we would need to care about. And then finally, there are also OS components, like the maximum amount of processes that the system can have, the maximum amount of open files the system can have, maximum connections. We're not going to go into OS components because when you get into those things, the issues you encounter are usually pretty esoteric. We're going to stick to physical server and software components today. So what is utilization? Utilization is the average time that a resource is busy doing work, actually doing its purpose. And the way that we tend to speak about utilization is the percentage over interval of time. So for example, 75% of memory was being used on a server over the last five seconds. You also have saturation, which is where the resource is all used up, there's extra work, and the extra work is being queued. For example, the queue weight value in the Drupal Request log is starting to increase because all of the PHP processes are handling requests. So you can observe these things in tools as well as logs and error messages. So when you're in the supermarket checkout and you're waiting to get checked out, if there are more people than cashiers, you can say the resource of the cashiers is saturated. And now you have to wait. That's what saturation pretty much is in the real world. And then finally, errors, which is the total number of events showing that a component or a resource is not functioning as it's supposed to be, error events. So for example, if I tried to access a file, I could get the input output error because something is wrong with my hard disk. You can also observe these types of things, sometimes through commands, but usually logs and error messages displayed to you in your application. So that's the use method. Now let's talk about hardware resources as it pertains to the use method. So there are four main ones we're gonna talk about today. You're gonna find this on almost any system, any computer, and really when you're dealing with performance problems, these are the four basic areas that you're gonna encounter issues for. CPU, memory, storage, how much capacity you can store, as well as how fast you can read and write, as well as network IO, how fast you can read and write to network. So how many of you know about top? How many of you are experts on top and everything that it says ever? Okay, it's good that you admit that because I don't know 100% of it either. So I think it's really important that when you're starting out and you're getting to understand some of the tools that allow you to gather metrics and infrastructure, it's important that you start with the simple basic tools that display the key metrics first before you bring up the big old dashboards because then when you start using those tools you have a firmer grasp on what you're seeing. So let's start with CPU. Now there are actually many ways that the CPU can be utilized on a system but we're only gonna talk about four because those are the big four that you most commonly see. So the first one is CPU user. That's the normal one that you would usually see which is the time spent executing applications, crunching numbers, Drupal would be an example of this. The second one is system. So this is time spent in the kernel. And in general, from my experience at Acquia doing operations, big spikes in system time tend to relate to talking to the network devices. You also have IO wait which is time spent by the CPU just waiting for a storage device to return back from a request like I do a massive read or a massive write. I'm waiting for that request to come back. I'm stuck in IO wait. And then finally idle which is the time the processor is spent doing nothing. So it is reasonable to say that when idle is at 0%, it's saturated. So you can observe these metrics, excuse me, in aggregate so across all of the cores on your system or on a per core basis which is important when considering single threaded applications but it isn't very common. So how do you measure CPU? Well, there's a lot of tools but the favorite one that I tend to go to is DSTAT. In particular for CPU, DSTAT minus C. It's recent, it outputs in color which is really nice. There's also MPSTAT which is an older tool that has similar output but it's not colorized. When you start getting into the dashboard type tools, Htop also colorized, very pretty, nice output. Top is the classic, many of us know about that. And then there's Atop which I like to at least mention because it supports process accounting, you can gather what processes are running over time. But we're gonna talk about DSTAT for now. So can you guys read this okay? Excellent. So a little exercise, not really a quiz or anything but just take a look at these and let's think about what could be happening on this system when these things are happening. So you can see on the top left there's a lot going on at IO8. Okay, I wonder what might be happening on the disks. Top right you can see spikes in system time. Hmm, that might have something to do with network. You can see on the bottom left nothing's happening. This machine is completely idle. And then finally on the bottom right, CPU user is almost used up the entire system. It looks like it's crunching a lot of numbers. So what's happening? Well this is what was taking place during those measurements. So it's really cool that you can look at CPU metrics and you can guess with just that data what could be happening, which is pretty powerful. So discussions of CPU would not be complete without talking about load averages. Tools like uptime and top, display load averages which are usually three values. It's in essence the amount of process, amount of programs or amount of processes they're gonna be competing for time to run. On the CPU and we look at it over one minute, five minute and 15 minute averages. So a general rule of thumb that I use when I'm looking at performance issues and capacity planning is if the load average is greater than the amount of cores on the system, then you're dealing with a saturation issue usually. And you can easily find the number of cores with a command called n proc double dash all. So you can see my load average on my laptop when I made this was 1.62. The number of cores I have is four. So I'm not dealing with saturation on the system at this time. So that's CPU. Let's move on to memory. Memory pretty straightforward, otherwise known as RAM. You use it to store temporary data in terms of running applications. You can check its utilization with a command called free minus M which is the go to when figuring out how much memory is in use. Now the big metric that you're gonna care about on the right is available. That's how much you have available for new programs to run. But there's some other attributes from the output that are interesting as well. Used is how much is actually being used by programs. Shared is the memory shared between processes like IPC and things like that. Buffers are used to write and read from devices. So that way those devices are not starved for data. And then cache, cache is very interesting. When you read files in Linux, Linux will cache them in RAM for you so subsequent reads are fast. It's not taking the RAM away from the pool for new processes however. Linux will happily free up that memory back to you if you're going to run something that uses a considerable amount of memory. But yeah, in this output what is my utilization and how much do I have available for new processes available on the right is what you need. Now, on older systems you do see output that looks like that. And you don't see the available column there on the right hand side. So what you want is the free column with the plus minus buffers and cache row. Because again buffers and cache can be freed up by Linux at any time in order to run new applications. And there is a hilarious website called linux8myram.com which explains all of this for you. So that way when someone comes up and says, oh my goodness, free memory is almost at zero. What's happening? You can tell them it's okay. There's this thing called cached memory. It's awesome. So what happens when you start to run out of memory on your system? The first big thing is something called swapping. So when configured you can set aside a portion of your hard disk to kind of expand memory out a bit and expand the RAM capacity. And you either do it in a partition or a file or something. And what happens is as you start to use up all your RAM, the Linux kernel will happily copy the contents of RAM that's not being used onto the disk, onto the swap file, which is great. The only problem with this is that hard disks are orders of magnitude slower than the RAM. So the process of you copying stuff back and forth as you continue to eat up all of available memory and the swap, they call that thrashing, it's really bad on performance. And you can check if swap is being used again with freedashm that was in the previous example. And you can see right now that I have eight gigs of swap, but I'm not using any right now, which is a good thing. Now what happens when you completely run out of memory? Like you can't give anymore. There is a subsystem of the Linux kernel called the oomkiller and it will run when you run out of memory. And it has its own little algorithm and scoring and waiting depending on how it's configured and it will start killing off stuff to free up memory. Now you can check to see if this took place by looking at the kernel log which location could depend on the distribution of Linux you're running, but also you can just run dmessage, which is access to the kernel ring buffer, which is pretty much a log. And you can see here in this example, the kernel decided it was gonna kill off my SQL, which is a really bad thing. So this is why out of memory conditions on your system need to be avoided at all costs because you do not want my SQL taken out. So that's memory, pretty straightforward, right? So we're gonna talk about storage, disk storage here. This is pretty straightforward. I think a lot of us probably know this tool because we don't wanna run out of storage on our laptops and computers. So in order to measure utilization, you run df. I like df minus m because it prints it in megabytes and that's easy to reason about. So you can see on the right here, 19 and 24% and 19%, these are actually the same thing, but whatever. So I'm not full. That's great, right? So when this is 100%, you're saturated. That is also pretty straightforward. Or is it? So there is a little, not so well known secret that I will share with you now. You don't wanna just measure how much data you're storing on your storage. You also need to know the number of inodes that are on your file system. Inodes are basically a data structure that stores information about a file and when you format a disk, it determines how many inodes you can store. So in this example here, I try to create a file and it says no space left on device and I'm asking the system, well, how much storage do I have available? Well, I have quite a bit here, but I have no inodes and this is what happens. So you can't change that number either once you format the system or format the partition. So when you do get those types of messages, check the number of inodes, that might be your problem. So we talked about disk storage, very straightforward. Let's talk about IO. So we're talking about read and write operations to disk, which is actually pretty important. In order for you to get information about disk IO, there's only one command you need. At least there's only one command I've ever used in the seven years at Acquia, which is IO stat. And I use those flags, MXT1, which says in megabytes, extended attributes, run it once a second and give me second metrics. So let's talk about the output here because there's quite a bit going on, but we'll just show you the ones that you care about. So you have the read and write megabytes per second. So you can see that I'm writing 154 megs per second in that instant to the disk. So we're doing quite a bit. You can also see the A-weight and then the read A-weight and the write A-weight values. So this is the amount of time in milliseconds it takes on average for writes to complete to the disk, which is really useful. You also have this util utilization on the right, but don't trust it all the way, especially for things like SSDs, that number gets skewed. So rule of thumb, at least on my infrastructure, when this value, the A-weight, starts shooting up through the roof, that's a saturation metric. And at least from my experience, if you're seeing metrics that are greater than 1,000 milliseconds or one second, that's a problem because you don't wanna wait for an entire second for an operation to complete. So that's disk. We're actually applying through this pretty fast, which is awesome. Let's talk about network. So most systems have gigabit on them, most standard computers. And you can check on Linux what the theoretical maximum is from the system by running ETH tool and check for the speed. And you can see that, okay, this is a gigabit card, 1,000 megabits per second, great. Now, there's a tool where you can actually observe per second data rates from all of the network interfaces. The tool that we use in operations is a bandwidth monitor and G or BWM and G, and you can see the command we run. And it runs once a second and it will happily display the total throughput or this is the total throughput and then the reads and the transmits. So you can see that at that moment it was taking 1.1 megabits per second, oh no, 11 megabits per second, which is only 1.1% of gigabit. So it's sleeping. So that was hardware, not too bad, right? You with me so far, anyone sleeping? Okay, good. So now we're gonna talk about software resources. Now, you're like, well, how does that work? Believe it or not, all software services that you would use to construct a Drupal website with, the lamp stack, maybe things like varnish, they have some form of tunable resources that introduce a constraint or upper bound of what it can do, which makes sense because you don't have an infinite budget, therefore you don't have a computer with infinite processing power, capacity, storage, memory, what have you. So usually they come in the form of a few things. You either have process pools, so like PHP, FPM, you have so many that are running or can run, you have connection limits, so Apache has connection limits, and then memory allocations like PHP's memory limit, right? So let's go over the common ones and we can talk about how to detect saturation of each because it's actually really important. So PHP's memory limit is a very, very common one and I'm sure a good portion of you in this room have seen and encountered this before. So you tune it and it defines the amount of memory that a PHP process can use. So if you saturate it or use it up, the program exits uncleanly and usually get a 500 error, right? So you can actually see that in the logs. So you can see, oh yeah, I had an error and memory size, I guess this is like 128 megs, I ran out here, right? So that's how you detect memory saturation. Moving on to PHP FPM, there is a parameter called maxchildren, which defines okay, this is the number of concurrent PHPs that I will run in a given time. So that equates to the number of clients backend requests that can be handled at once. For those that are still running a mod fcgid or remember the times when they ran that, the max process is per class is the equivalent parameter and you can detect saturation by messages that look like this for FPM. I reached the maxchildren setting, consider raising it. So when you're hitting that, what's happening is that you have, just like in the supermarket checkout, all the PHPs are handling processes, there are more requests coming in that it can handle, it's used all of them, so you're hitting a saturation. You also have MySQL's max connections, which is basically the number of concurrent MySQL sessions that can happen on the system. And when you run out of those, you end up with the too many connections error. So all very straightforward, I'm sure a good number of you have seen these in the past. You also have Apache, maxrequestworkers. So those are the number of simultaneous requests that Apache will handle and you get similar messages. So maxrequestworkers. Okay, now we have NODB buffer pool size for MySQL. So MySQL has this big cache called the NODB buffer pool and I'm assuming that hopefully all of you are using NODB rather than MyISAM for Drupal. So the buffer pool is this big cache and as you read things, as you do selects, the buffer pool gets filled up. So that subsequent reads are served from memory, which is a wonderful and lovely thing. Now, if you're reading a whole lot of stuff in the database and the cache is too small, well objects have to be evicted from cache for the new stuff to come in. So there is a way to check for that, which is the buffer pool wait-free. Now the buffer pool wait-free is a counter that counts the number of times that it had to wait for an object from the buffer pool to be flushed to disk to make room for new stuff that's a result of selects. So if you run this command a bunch of times and you see that value just increase and increase and increase, that means your buffer pool is too small and you need to increase it. Same thing for varnish, how many of you use varnish? Look at a number of you, excellent. So for those that aren't aware of varnish, it's really cool. You put it in front of your Drupal infrastructure and it will cache the pages that Drupal renders for you based on the headers and how long you want the HTTP responses to last in the cache and that's great because the next time someone comes around and wants to load your page, you're serving it from memory again rather than doing an entire Drupal bootstrap. So the way that you check for saturation of varnish is there is this, when you run varnish stat, there is a attribute or a metric called NLRU nuked, which is very similar to what we just talked about with the NODB buffer pool. So when you have a full cache and you have to store another object, varnish goes and uses the LRU algorithm last recently used and it nukes it, throws it out. So it has room for other things to be stored. So if that counter here continues to increase and increase, it means you're continuing to evict objects from memory, which is an indication that your varnish cache is too small. Now I've been talking quite a bit about various tunables and various things that you can change resist the temptation to just go and blindly change those settings and there's a couple of reasons why and a couple of concrete reasons why. So for example, the PM max children, so that's the maximum number of processes that PHP can execute. If you just go and bump it up, like double it, what you've done is you've doubled the amount of memory that you have to have in order for all of those processes to run and if you don't have that available memory on the system, you will run out of memory, you'll have the um condition and then things get killed off and then your site is down. Aquia internally actually has a mechanism that figures that all out for you so you don't have to think about it. Same thing with any set, custom any sets, that's not good because if you go and you set any set to like a gig and you're not thinking about the infrastructure it's running on, you can take your site down and you can run out of memory on the server. So that's the software resources. Now we're gonna get into something really, really interesting and fun. It's called process introspection. And to kick this off, we're gonna get a little scene from one of my favorite movies, Hackers. Let me turn the volume up. I've narrowed the activity to terminal 23. Let's go 23, see what's up. Let me pause that and turn the volume up so you can actually hear what's going on. My apologies. I have a fear. I is here. I've narrowed the activity to terminal 23. Let's echo 23, see what's up. So wouldn't that be cool if you were able to watch what's going on in your system to the degree that you know who's on it, what files are being accessed, what network connections are happening. Wouldn't it be great to have that kind of power? Well, you can do this. The problem is it doesn't look as cool as when the plague does it. I'm sorry, but I will share with you how to do this very thing. And you can feel that thrill of, yeah, doing this awesome hacker stuff. So allow me to introduce to you one of my favorite tools. It's called S-Trace, which is a system called Tracer. Now what you do with it is you attach it to running programs and it will tell you in real time what the program is doing in terms of system calls. Now, what's a system call? A system call is basically what happens when a computer or an application is needing to read a file, talk to the network. It has to go through the kernel, right? So it makes requests of the kernel through a system call and then the kernel does all the things in the background to give you your file, your network connections, you can keep going. Now, do keep in mind, for those in the audience that are running these tools live in production, as we do, this does slow down execution. So S-Trace is really good for troubleshooting and debugging, but don't just run it, because it will slow down stuff. Now, here's a basic example. We're gonna slowly get into the water here. This is an example of the chunk of output from S-Trace where we are printing out the contents of the file DevNol. Of course, that file doesn't have anything in it. That's cool. So we're gonna go over step by step what's happening. So on the top here, we are opening the file. So the syscall open is being run with the parameter DevNol and I'm opening it up read only and it spits out a number. That number is what we call a file descriptor. Now, you can see in subsequent system calls that we are calling, again, against this file descriptor. So you can see fstat3, fadvi64, 3, read. So this is where it actually reads the contents of DevNol and it returned nothing. And then you can see that it closed the file and then it closes the file descriptors for how it prints out to the output or prints out to the display and then closes out. So that's pretty straightforward, right? So you see what exactly is happening when I went and traced cat DevNol. And the cool thing is all of these system calls, they have man pages for them. They have manual pages that you can read. Because for example, I have no idea what fadvi64 is. I never use it in the seven years of Benedocquia but I can look it up with man2 syscall. You have to specify the two because that's the section of the man pages that refers to just system calls. It's been like that since the 70s. So use man2 syscall. Now we're gonna get into something really interesting. So being that I work at Acquia and being that Acquia is responsible for hosting Dries's blog, I yesterday went on a server and I decided, okay, I wanna see what Dries's site is doing. So I attached my strace to his PHP process and I started watching what was happening. So let's see what's going on here. So you can see on the top, so send to RA, so we're sending something I guess and we have a file descriptor of 11, okay. Select something, something, something from cache container. Okay, we're doing SQL. So we are intercepting in real time the SQL statements that are being executed by this PHP process, which is really cool. And then you can see that it's waiting for the response back from my SQL. And then, look, it read back the results in the select. So you are able, when you connect to PHP processes, you can see what's going on, which is really cool for debugging. Now let's break it down. So dash f follows child processes. Now what I mean by that is, I was attaching to the PHP FPM process manager because the process that's actually running Drupal hasn't been spawned yet. But if you use dash f, it will follow the processes that it forks and executes, which is really cool. Dash p is the process ID. So you have to run top or PS, display the processes, pick the one you wanna trace, and then you can trace it. Dash s is the length of strings that you want to display from each syscall. I usually do 1024 or 2048. So you get enough to kind of get the gist of what's going on. There's some extra flags too that I sometimes use. So the one here, the trace send to receive from, you can say, okay, I only wanna see those syscalls. So that means, oh, I only care about the MySQL stuff or the memcache stuff. That's a great flag that you can use. When you use the exclamation point, that's like not, right? So you can exclude syscalls. How many of you use a new relic on your site? So if you ever strace your process, you're gonna see get time of day all day. And it's gonna be really hard for you to figure out what's going on. So if you use that flag, it'll exclude get time of day, and then it makes it really easier to see. And then dash t, which I find very useful, is to measure the time spent in each system call, and it will print it out for you upon completion. This is very useful in order to figure out, hey, why did that call take an entire second? That's kind of slow. Let's look into those particular operations. So it allows you to kind of profile what's happening, which I've used many a time. Okay, so what can I do with this strace? This is all pretty interesting, but what can you do with it? So a very good use of it, and if this is the only thing you've ever done with it, you would gain a lot is tracing PHP processes. So you can see what Drupal is doing in real time. So you could observe MySQL, you could observe Memcache, you can see the responses to the request. You can see like the headers and the HTTP response code. You can see what files are being read or written to. And again, you can measure the time spent in each place to see, hey, is there a performance problem doing a particular operation, which gives you clues. There's another tool that's kind of like, the sword is strace, the shield is LSOF. They tend to be used in pairs. So what LSOF is, it's a tool that lists open files. So what you do with it is you run it against the process or you can run it on the system and it will print out everything. Every single file that's being opened, as well as all network connections because in Unix everything's a file, so it prints out everything and you do the same dash P to specify process ID. And it will also list the file descriptors. So when you're looking at strace and you see the file descriptors being used, you can go to LSOF and you can actually see what that pertains to, which is really cool. So here's an example of LSOF and we're just gonna get right into it. It looks really daunting, right? But I promise you it's not. So you can see on the left-hand side, it's telling you what command it is that's doing this. So this is Vim. It's a text editor, cool. You can see that the current working directory is my home directory. You can see that the program is, this is where the program's at. All this big chunk of gobbledygook, this is just the libraries that Vim needs to run. And then on the bottom here, we see some devices that are used to talk into the console, but oh, there's my garbage file just like in the movie, right? So this is what it looks like in real life. In the movie, it's just movies, but it's a similar deal. So that's LSOF. Now, we're actually making excellent time, which means we have a lot of time for questions, which is cool. So I wanted to go over a couple of other scenarios. We've gone over quite a bit, so now we wanna bring it all together and put this stuff in use. So I'd like to talk about what's going on in my mind. I've done on call for a very long time, and over the years I've kind of developed this process, and I noticed that this process is very, very similar if not identical to something called PDCA, which is on the right. So we'll go over the process and you can kind of see what's going on. So my troubleshooting Cata for outages that involve infrastructure are as follows. The first thing I do is the use method. So for every resource, look at utilization, saturation, and errors across everything. With that, you're gonna be able to find the points of saturation, right? And those are your constraints. Those are your limiting factors that are going to stop Drupal from loading as fast as you want to. Then you plan, okay, what's the biggest constraint? What's the thing that you believe is gonna be of the biggest impact to the performance of your site? So pick it. And then you need to have a plan as to what you wanna do to address it. Then you do it. So you just implement the change. And then you check and you measure. Did you fix the problem? Is that resource still a constraint? Is the site loading faster? Is it available now? Is it fixed? And then you act. So the site's back up, awesome. You fixed it, wonderful. If things are getting better, like measured better, but it's still down or slow, then okay, keep going. Go back to plan. Figure out what the next constraint is and repeat the process. And if it's unchanged or you did something wrong and it got worse, undo the change. It's really important that you only do one change at a time, because as you are troubleshooting systems and you're making more than one change at a time, it's really hard to track what thing did what result. So that is kind of what's going on in my head when I'm troubleshooting these issues. So just to lay down the groundwork here, this is the stack that we're gonna use for our two scenarios. So I have my laptop and I'm gonna be loading the site. There's a load balancer in the front on port 80 and it's running varnish. And it's reverse proxying back to a couple of Apache servers and they're both running PHP, FPM and they have 10 proxies, so we know 20 total. And then there's my SQL server with some storage and a file system server with some storage. Not too bad, right? A hyper simplistic example of a multi-tier lamp stack. Now, here's the first scenario. Your boss calls you up and says, hey, my page is slow and sometimes I get timeouts. Okay, so what we do is we do the use method on the first thing we see, which is the load balancers. So we go and we run all of those tools and we see there's no saturation anywhere from what we know or understand. Okay, fine. We move to the next part. That's the webs. And what we find is all of the max children, all of the PM max children, so all of the processes in FPM are used. So all 20 of them are being used. But something's interesting. All of the CPU metrics aren't corresponding to that. The CPU is mostly idle. So the processes are up. Requests are backing out of the, there's a line going out the door, but nothing's happening. So we can decide, okay, I'm gonna run LSOF on those processes and see what's going on. And we get this particular line of information out of every single process that we run LSOF against. Can you guess what's going on here? Can anyone guess what's going on here? Bingo. So in Acquia Ops, we call this an external call. And the concept is that from within your Drupal code it is making an external call to a third party resource that you don't own. And if it's down or slow, your site, or at least the pages that depend on that third party service are going to also be down or slow. While in operations, I have seen instances where your process is making calls back into the website but because all of the PHP processes are in use, you're waiting for the result of a call that can never complete because all of the stuff is in use at the moment. So that's kind of an interesting issue. And the solution to these types of problems is to remove dependence on third party services or at least when you're gathering information from third party services, don't do it in the page load. Or you can program defensively to gracefully to grade, okay, I was trying to get that for a few seconds, I couldn't, I'm gonna back off and gracefully to grade. So that's scenario one, awesome answer, thank you very much. So here is scenario two, similar thing on the surface. So our site is loading slowly, I'm getting some timeouts. So we apply the use method to the balancers again. We're still not finding anything. We apply the use method to the web servers and we see, okay, all the PHP processes are in use again, so all 20, all right, they're in use. And we see some CPU. So half utilized, but not everything, it's not maxed out. So we're like, okay, that's not saturation yet. So we're gonna keep going further back in the stack. So let's try the database. So we apply the use method to the database server and we run iOS.dat and we find this metric or this set of metrics to the database volume. I know it's really, really faint. For those that can see, what's happening? Excellent, excellent. So for the recording, a member of the audience said, hey, we're doing massive writes and you can see 54 megabytes per second. The A-weight values are elevated. And the utl, which yes, you don't trust it completely is close to 100%. So we're doing massive writes. Okay, so we know something is happening on the database volume, on the database server. So it's reasonable to say, okay, what's going on on the database? So there's a few ways you can print out what's going on in the database, right? So you can just go to my SQL prompt and say, show full process list and it'll print out everything. I personally like to run the tool my top because it gives you some statistics too and it refreshes every few seconds. And we see a big pile of statements that look like that. And I hear a chuckle or two, so I think some of you know what's going on. What did we find out? Correct. So the answer is DB log was turned on. There might be a bug or something in the code. PHP notices are just slamming the hard disk. So yeah, the DB log module was enabled and if your site has some bugs or some code errors, it's gonna slam the right bandwidth, the available right bandwidth on your storage hard disk, which is bad because then you're preventing Drupal from doing its actual work, which is to load websites and update content and things like that. So the solution and this is from the Aquia operations team to you, my fellow Drupalists, please don't use the DB log module, use this log instead. So let's recap. Troubleshooting infrastructure as we can see with a couple examples here is accessible to mortals. You can use the use method. You measure utilization, saturation, errors, only components you can find, resources that are saturated and look there. You can look at hardware metrics. You can look at software resources and metrics. You can even take a look inside processes like PHP when they're running to see what's going on. And you can put in your mind the plan-do-check-act process where you're changing one thing at a time, you're doing experiments and you're making inferences and hypothesis as to what is wrong and if I make this change, will it fix the performance? So we actually blazed through this, which is awesome. So that means you folks can leave a little early because it's the last talk of the conference. But I want to open up to questions. Yes, sir? Actually, before, yeah, we have the microphone there. So that's for the benefit of those that are watching the, yeah, thank you very much. One of your earlier slides, I don't know if you can go back, when you said we'd better write better code to not tie up all of the 20 processes or something. I mean, that would certainly be, yes, the ultimate solution. Yeah, maybe that's it. Remove this dependency. Okay, yeah, those would be the long-term solutions. But when your web server's down, what's the remedy? Ah-ha, so you're talking about how do we minimize the impact to you or minimize the impact to the customer while the third-party service is down? Like right now. Right now. So there's a whole spectrum of things from the thing that is sustainable and the thing that is dirty and hacky and we need to fix it. In some extreme situations, we and ops have gone out to the web servers and IP tables blocked. Connecting to that IP on that port. So you get an instant, I'm not talking to that. And then the page loads will continue. Right, so it'll break. At least your web server's back. Exactly, exactly. So that, yeah, you're gonna have probably a bad time on the customer or the client level presentation, but at least you're not just getting 503 errors and timeouts and not loading the page at all. That's an excellent question. The web server's down. Oh my God, what have we done? We don't have the problem yet. Right. We gotta bring it back up. Yeah, absolutely, absolutely. In my SRE experience, we've been putting together this entire life cycle of incident management. And yes, we do go through this process because that's triage, that's us figuring out, can, do we have the ability to solve this problem? But the first step isn't, I'm gonna attack the root cause. The first step is I'm going to stabilize my patient, right? So in those situations, you're figuring out what steps can I take to keep us under SLA and we go and we implement those. And then as part of a retrospective process, then we can talk about, yes, this code is depending on a third party service. These are the recommendations and changes you can make to improve the availability of the site. Excellent question, thank you. Any other questions or comments about this talk? Hi. So, hi, if you, if my SQL database is slow, like how is there a way to tell is that because of my SQL config or is this because of server? Is there a way you can tell or to look for? Sure, so it goes back to the use method, right? So there's lots of signals that we can get from how is the hardware being utilized, right? And one of the common situations is IO on the disk is completely saturated, at which point, there's only so many things you can do. But yeah, there are some indicators where you can do various things, like for example, with performance, again, one of the things you would do if the NODB buffer pool's too small is you might end up making the machine larger so you can have a larger cache, you have less misses, and therefore you're not talking to the disk as much. So it's really the process of what's the bottleneck and how can I make a change to eliminate that bottleneck? But yeah, it really depends on the situation. I've seen all kinds of various solutions to problems, some of them in software and some of them in hardware and some of them in my SQL tunings in between. My SQL tuning is a more of an art rather than a science and that's why there's a lot of people that hire database administrators that do this type of stuff. But yeah, good question. Okay, thank you. You're very welcome. Any other questions? So I know you said SSH is kind of a prerequisite to do this, but if that's not possible or difficult, maybe you have a ton of servers, do you have strategies in that case to aggregate logs somehow or aggregate these metrics and be able to surface them in a different way? Yes, I mean, there are various things from the instant, like I can just do this right now and gather data in place to I have monitoring and I am looking at a dashboard. Now the ideal of course is setting up proper monitoring and telemetry and having a dashboard like DataDog or SignalFX or something. Aqua uses SignalFX and be able to look at the dashboard and say, yeah, that's a resource and do that. But this talk was more about getting started. If you're talking to a lot of servers at once, like maybe a whole lot of web heads at once, a tool that I introduced when I started Aquia and it's still being actively used today comes from LLNL, which is like a laboratory. It's called PDSH, Parallel Distributed Shell. So we wrote a little wrapper tool around it that is aware of Aquia Asset Management and I can say, for this customer, give me the CPU utilization with DSTAT across all of the web heads and it will give me in bulk all of the servers at once so I can see is the work distributed evenly across the machines? Is it just one? And if it's just one, I can zero in on that host rather than thinking it's a systemic problem. But yeah, Parallel Distributed Shells is very useful. Having some kind of log aggregation solution like Arcyslog and shoving your logs into a centralized place where you can query it is also a useful technique. Also having monitoring infrastructure so you can have dashboards and graphs and seeing what is going on at scale is very useful too. Thank you. Hey, so I was just wondering, is there any kind of like, when it comes to setting up a specific server, is there any good tools that you may use to figure out what resources to set for like Apache memory limit and client, max client connections, I know DB. Like I know you'd use something like DB tuner or something like that, but is there anything for like Apache that you might use to kind of get a base configuration set up? It's a very good question. I mean, I know there's like a MySQL tuner.pl script but what we've done mostly is we have Secret Sauce and Acquia where, I'm running this type of server with this particular hardware configuration and based on, usually it's gated on the amount of memory. We have some rules around, this is the percentage of memory that we're gonna set aside for the NODB buffer pool and it's all really based on the size of the machine. But any special tools that magically produce the optimal tunings, again, that's an art than a science. They're like the MySQL tuner.pl is kind of interesting but I don't, I wouldn't stop there. I would kind of tweak and see and it helps if you know what your performance should be and set a standard and then work towards that goal rather than just, yeah, absolutely, thank you very much. So as we have just a few minutes left, I wanted to ask if you had any tips on identifying problematic traffic. So let's say that we have, I have like a witch's brew of grep statements I use to look for IPs, but something more organized might be helpful. That's an excellent question and I'm quite grateful that I have an answer. So there are some tools. So one of the things that we did in the old days is we used to use the parallel distributed shell and get all of the Apache logs and actually tail them all, so get all the updates and then we would run a couple things. So one of the tools that we used was called GoAccess. So GoAccess is a web log analyzer. It reads in Nginx, it reads in Apache, it reads in Varnish, any NCSA format logs it will work just fine with and it will give you statistical breakdowns on what common IPs are hitting you, what user agents are hitting you, as well as the proportion of the HTTP response codes you're getting and the cool thing about GoAccess is that if you're running it against a file that is being appended to in real time it will update its statistics, which is pretty nice. Now if you want like an HTTP style report real fast of visitors is an interesting tool that you can run and get like a breakdown of similar type metrics, but I would start there. They're command line based, again they can and do output HTML for reporting but that is a way you can get started real quickly without setting up a service or something. Awesome, thank you very much. Pleasure. All right, so got a few more minutes. Any other questions? I have a bigger umbrella question that's not necessarily related to this tool set. I saw your background and say that you were ops and then now is a DevOps shop. Do you have any tips of like how I should ask my company to buy into the entire DevOps paradigm? I am so happy that you asked and I will reward you with something, which is this. I did a talk in Baltimore called Viva La Revolucion which talks about an eight step change leadership process around introducing DevOps to the workplace. So I went through this process personally. I started it when I was in ops because the fleet was growing. Like we were a very successful company and the fleet was just growing and growing and growing and the burden on the ops team increased linearly. So I wanted to get a solution for that so I started visualizing work and gathering metrics and creating a sense of urgency in the organization that hey, there's a problem, we need to start working on it and that was step one. So definitely check out this talk if you're interested in how to drive a DevOps change. Very welcome, cool. We got time for one more question maybe. All right folks, it was a pleasure to share this information with you. So I'm gonna be here until Friday afternoon. Please provide feedback. I love to speak and I wanna be a better speaker. There is the contribution sprints tomorrow. Please, they would definitely appreciate you donating your time. Media credits for all those animated gifts and stuff I put up and my contact information. Thank you so much.