 Imagine it's July 1969 and the Apollo 11 lunar module is descending to make the historic first landing. Five minutes into the descent, at 6,000 feet, there is a problem, a 1202 program alarm. Some of you may know the details of this, but the 1202 program alarm is actually a performance issue. If this performance issue is not diagnosed quickly, Neil Armstrong must abort the descent. As he later said, he wasn't there to practice abort. Fortunately, one engineer had seen it before and was able to give the go-ahead for the landing. Imagine the Apollo lunar module guidance computer is the system that you are performance analyzing. How would you begin to understand performance on that system? What methodologies or process would you go through to get to the bottom of that alarm? What I like to start with for any system is a functional diagram. Here's a functional diagram that's from the Apollo era. I've actually annotated a little bit because I was missing a few parts of the computer. I was missing the erasable memory, which was 2048 words of core memory, the 36 kilo words of core rope memory, and there's also the VAC sets for vector accumulator and the core set area for registers for swapping out tasks. It was a really interesting and pioneering computer, so the lunar module guidance computer was a time-sharing system. I like to start with functional diagrams because then I can work through the blocks and figure out what makes sense. In some ways, that's the first methodology I can tell you, and from my time as a performance consultant, this always worked well, to tell the customer, draw up the functional diagram for your environment. You'd be surprised or maybe not surprised how many times no one had done that. When they draw the functional diagrams and problems are evident, also useful for getting into more detail. So I'll return to this example and we'll see how we can analyze that computer as well as any system. I work at Netflix, and yes, this is regions where Netflix is available. Thank you. Pretty good so far. We have Linux Ubuntu on the cloud, and we have FreeBSD on our CDN. The easiest way to describe this is when you first log into Netflix and you authenticate and browse, you're on the Linux cloud, that's AWS, and when you hit play, you're now coming from the FreeBSD CDN, and send Vindu reports that show that we're over 33% of the internet traffic in the US at night. So very popular. Now, as background for performance methodologies, I've done this for a long time, and system performance up to the 90s was for close source unixes and applications. And it's where the vendor would give you manuals. A long time ago, I did VaxVMS administration, so this is 12 feet of manuals. And you would read the manuals and you would assume that the vendor had come up with the best metrics possible. And it was our job to interpret the metrics the vendor has provided. Now, some problems is the vendor may not provide the best metrics. Quite often customers are running workloads that they haven't really tested themselves and they haven't really thought about in detail. There are blind spots, and so we had to infer rather than measure. I used to do performance in this era when I first started, and you could be quite successful if you could read the tea leaves from the output of VM stat and MP stat and PS and various tools. And figure out what the kernel was doing even though the tools never said that. Today, it's really different, it's really exciting. Almost everything we expect is open source now. And we can now do custom metrics. So if it's open source, I can take the source code and write in some metrics or instrumentation I need, some advanced logging, some printf statements. And then I can run a custom kernel or a custom application. But we also live in the era of dynamic tracing. So any software at all, I can instrument and then get metrics out. It's now more useful than ever to think about methodologies. So what do we do with these awesome powers? How do we explore a system and understand it? One of the problems of dynamic tracing is, since you can instrument all software, if you think about how many functions are in a body of software, it's tens of thousands. And you can see the arguments and you can look at latency. You can do histograms. You can really drown in metrics. There's a lot of metrics. Methodologies help guide you through those metrics so that you can put your fingers on the ones that are most important. I was heavily involved with the launch of Detrace, and that was a big problem we solved Detrace when that was launched. It was a great superpower, but what do we do with that superpower? So it's a different type of thinking now. Instead of assuming the vendor's going to give you the best metrics. Now we think of the questions we want asked. And it's opposite to being provided the answers to start with and then trying to figure it out. Now it's about the questions. What do you really want answered from the systems? Some anti-methodologies to start with so that you can understand methodologies. And the first would be what I call the streetlight anti-method. And this comes from a parable about a drunk who's looking for his keys under a streetlight. And a police officer finds the drunk and asks why, what are you doing? And he says, I've lost my keys, I'm looking for them. And the police officer asks why did you lose them under the streetlight? And the drunk says no, but that's where the light is best. And I see this quite often in performance analysis. People will run top because they always run top. And that's familiar. Instead of using more advanced tools or tools that's appropriate for that subsystem. They'll run what they find on the internet, they'll run things at random. Sometimes works, but it can really waste time. You can go around in circles. You can miss things because there are blind spots. Now that was an observational methodology. This is an experimental methodology, the drug main anti-method. And this is where you tune things at random until the problem goes away. Just to give these methodologies a name so that you can recognize it in the workplace. I once came to work and the application I was working on had been, the configuration had been changed quite a lot. And I was told, Brenda, we had to change the config, it was performing badly in the middle of the night, we didn't wake you up. By the way, we used the drunk man anti-method. And that was actually useful because immediately I understood why they picked the crazy metrics, the tunables they picked. They just guessed. And so then I could debug them myself and understand. Another anti-methodology I'd like to mention, I'd call the blame someone else anti-method. So you find something that is not responsible for you and then you hypothesize, the problem must be that component. Go talk to that team. And so I've seen this many times where people with the network is blamed. It must be the network. It must be retransmits or there's something wrong with the microwave link or there's maybe it's BGP issue. Go talk to them. Go talk to the DNS team, maybe it's DNS. That's an anti-methodology. Another one I don't like is the traffic light anti-method. And that's where traffic lights are easy to interpret. Red is bad and green is good. And so people like to create these dashboards where they put colors on everything. Now colors are good for objective metrics, so errors. So something is actually broken. So a disc has got a failure, you can color it red. If it doesn't have a failure, perhaps green. But for performance analysis, these are often subjective metrics. So IOPS and latency. What latency might be good for one person who's running a chat server online may be different for someone who's running a high-frequency trading application. And IOPS, how do you even say what IOPS is good or bad is a thousand IOPS per disc good or bad? It really depends on your point of view. So I don't like anti-methodologies where they try and put colors on subjective metrics. It's okay for objective metrics where something is broken and something is not broken. So that's a few anti-methodologies. Now for methodologies, I've been collecting them. I published a lot of them for the first time in my last book, Systems Performance. And I hoped that people didn't find this too crazy because no one had really thoroughly explored methodologies like this before. There had been some, so Method R was published for Oracle Analysis. And there's some methodologies in Raj Jain's Art of Computer Performance Analysis. But I really studied lots and lots of them. I think they're extremely useful. And I'll pick a few to go through here. So for systems engineers, these are ways to analyze unfamiliar system or application. And if you are the developer, you can have your dashboard support the methodologies. So these are my toolbox. These methodologies, they also work across any operating system. And so I've given a talk where I've covered methodologies before, but for different operating systems. And I was able to port them to BSD for this talk without too much difficulty. And the hard part is knowing what to do. And that's writing up what the methodology is. The first one is the problem statement method. And I learned about this from a support team, an engineering support team. This is the first thing that they would ask a customer over the phone. What makes you think there is a performance problem? Has the system ever performed well? And so on. And they're able to diagnose performance issues without even logging into the system. And it's always worth asking. When I was a system administrator once, I had a database administration team tell me, Brindan, this database is performing badly. You must log in and debug it and fix it immediately. It's like, wow, all right, let me have a look. And so I run VM stat and I look at the summary since boot. And I see it's always been on fire since it has booted many, many days ago. It's like, wow, that's strange because now they're telling me. And then I look at IOS stat and I look at the summary since boot. And everywhere I look, I don't see any changes to the system. So I'm trying to get my head around why would it be bad now when the disks have always been this busy and the CPUs have always been this busy. And so I finally remembered, maybe I should ask, has the system ever performed well? And the answer was no, it's always been like that. It's been like that for weeks. We only thought about asking you. But of course the way they asked me was as though something had changed, like something broke and I must dive in and debug it. So after that I always remembered, has it ever performed well? Has it always been like this? Can the problem be described in terms of latency? It's very useful. And people often will say, my CPU utilization is high, that's a problem. Is it a problem? It might be good. You're getting a return of investment. So coming up with a more metric that makes sense to reflect customer pain. Does the problem affect other people or apps? They say, well, latency for this application is bad. It's like, well, latency for every application I run is bad because the Wi-Fi is down for the office. And also checking what is the environment. So basic stuff, but always worth doing. A functional diagram method, and I mentioned this at the start, and that's where you want a functional diagram, you trace the components in the data path, and then for each component you check performance. However that works. I've got a nice pitch here of the ARPA network from 1969. You can imagine, if you're told the internet is slow back in 1969, you could break it up into only a handful of pieces and then go through them one by one and actually root cause the slow part. But that's what the functional diagram method is about. It's about breaking a big problem into smaller parts. Workload analysis. So this is also known as top-down analysis. This is where I begin with the workload that's applied to the system and then I drill down. What's happening at the application level? What's happening at the library level? What's happening at the system calls and the kernel going down to hardware? And trying to see if there is some requests that are performing badly at the application level, then break it down to what library calls are happening, what system calls are happening? Is the latency in the system calls or is the library calls just in the application code? If it's in the system calls, then what is the system call? Are we doing some reads to a file system and then break that apart? And so drilling down from top to bottom. It's useful because when you do this methodology, you're beginning with the application context. So you may begin with, this is the request that makes customers unhappy. It has high latency. And then as you drill down, you can then provide system metrics in the context of that application request. It can be difficult to dig from application to resource. Workload characterization, which is actually at the top of that, the workload applied, is its own methodology which I've used to solve many, many problems. And this is where you need to not think about the resulting performance. And you need to think about just what's the workload applied. I used to look after a storage appliance and many, many times customers would benchmark it by driving a crazy microbenchmark workload and saturate the system. And just by characterizing the workload applied, I didn't really need to look at the system. Just by looking at the millions of IOPS that are trying to drive out of something that was underpowered, that wasn't supposed to do that much. So like half a J-bot of disks instead of our full gear, it's like you've picked the wrong system. So always worthwhile to check just the workload that's applied to the system before you get into the resulting performance because the workload that's applied may be crazy. If you've got a web server that's misbehaving, before you look at the latency, are you under a DDoS attack? So that's workload characterization. Like as soon as you know you're under a DDoS attack, then you solve the DDoS attack. You don't need to start tuning Apache tunables to make the DDoS attack go faster. Now for workload characterization to drill into this a little bit more, for CPUs, I can break it up into four components which I listed here. Who is using the CPUs? Why is the code paths and context? What is, would be CPU cycles and instructions? What's it actually doing on the CPU? And how's that changing over time? And we can answer them fairly well. So who is using CPUs top or similar metrics? How is it changing over time? There's lots of products that plot that. Why we can do CPU flame graphs, CPU profiling, and then what are the CPUs doing? We can use PMC stat on BST, which is CPU flame graphs that I mentioned in a moment and really understand at the low level what the CPUs are doing. And you can do this. Who, why, how, and what for any component? I want to mention this one because if you look at most companies and monitoring products, they don't do those. They will do, who's using the CPUs so you can basically get top output and of course they'll plot things over time, you'll get line graphs. But fewer monitoring products will actually profile the code even though that's hugely useful to get a CPU profile. And barely anyone touches PMCs. Even though PMCs are becoming more and more useful because systems are getting faster and faster and the bottleneck is moving to the memory subsystem. And so memory stalls are a big issue. You need to have performance monitoring counters to understand it. So we can do better. Resource analysis is another methodology. This is where the analysis starts at the resources and works its way up. And you may be familiar with this. This is how systems performance has traditionally been taught by old books on systems performance where there'll be a chapter on disks and a chapter on CPUs and a chapter on networking because they're starting with the hardware components and then trying to figure it and then trying to identify if there's a performance problem there and then working its way back to the application. It's okay, it's another methodology. I mean, I'll use all the methodologies. There's not just one methodology I'll use, I'll pick everything. It's one of the problems is false positives for this method. Quite often you can look at the disks and say the disks are on fire, performance is really bad. And as you work your way up the stack, you realize that's right back flushing or it's read ahead. The application is not actually blocked on the disk. The file system is doing it asynchronously. And so just looking at the disk itself can be misleading. The use method, which is short for utilization, saturation and errors. I came up with it with this a long time ago. It's proven very useful. And that is where I will draw a functional diagram and then for each block and bus and interconnect on the functional diagram, I just want those three metrics, utilization, saturation and errors. This is great because there are lots of metrics. If you look at a system, there are hundreds or thousands of nested money sets or wherever you can get the metrics that are way too many. But this whittles it down just to maybe a few dozen once you iterate over the devices. And it also poses questions for the system to answer. So instead of starting with metrics and seeing how would these answer on the system, what sense do these metrics make? I start with my questions. I wanna see DRAM utilization, not in terms of volume, but in terms of, say, stall cycles or cycles where I'm actually talking to main memory. It's like, oh, that's actually a difficult question. Now I need to use some PMCs. And you can see how this methodology will drive you to asking questions you may not normally ask. Because the tools don't make it easy, which we can change. I came up with a Rosetta Stone of, for the use method, it's on my homepage. It's very, very long. There's only a fraction of it where I go through Linux, FreeBSD, Mac OS X for CPU errors, saturation, utilization. Those three metrics are actually really useful for just a higher level understanding of performance. Saturation, in fact, the order, in fact, the order I've printed them here is the order I'll check them. Errors, if something's broken, it's broken. That's easy to interpret. Saturation is the next most useful metric to interpret. So CPU saturation means I have more threads to run than I have CPU. So I have many threads in the runnable state that they're waiting on their turn on CPUs. That's bad. As a performance engineer, if I fix it in a number of different ways, I can improve performance. So saturation after errors is next most important. Utilization is last. So the CPUs are 50% utilized or 80% utilized. They may not matter too much. Utilization is interesting in the longer term for capacity planning because you know once you get closer to 100%, you're likely to have queuing and once you hit 100% you're likely to have saturation. But I start with errors then saturation and utilization. There's my FreeBSD one. Again, I had to truncate this for the slide. It's pretty long. But I can apply that to anything. So I probably need to update this too. I applied it to Unix seventh edition because that was fun and figured out how to do all the different things. Also applied it to the Apollo Lunar Module guidance computer. So if you went through this, so you're a performance engineer, it's 1969. You wanna understand performance of it. Some boxes in the middle include the core set area of the back sets, a raceable memory, fixed memory. The core set area was important. That's where the threads that aren't on CPU, their information will go and live in the core set area. It would be able to hold several of these tasks, concurrent tasks. So if you started looking for metrics for the core set area, you're looking for say utilization. Utilization would be how many of those several slots in use. Saturation would be if I filled up all seven slots, it means the CPU can't handle any more concurrent tasks. An error would mean if I tried to launch a new task and it's like I'm full, I can't run it. So it turns out the 1202 is the saturation error. When the core set area is full, the computer throws a 1202 alarm and that's what the alarm's strong had during the descent. It's interesting to see why it was because the rendezvous radar was turned on and misaligned and unlike in simulation, it was adding an extra CPU load to the system. They were expecting to have some headroom during the descent but because the rendezvous radar kept interrupting the CPU to do calculations that they didn't strictly need, the CPU went to 100% or the core set area went to 100%, they got the 1202 alarm. The rendezvous radar was tracking the command module so that if an abort happened, they could plot the trajectory to return. And so they could turn it off and still land. If I said the vector accumulator sets also fired on the descent, that was a 1201 alarm because they filled up as well. But use method. And you can actually go through hardware systems as well and figure it out. What would it make, could I do utilization saturation and errors for the rendezvous radar and for the descent engine? So in some cases, yes, in some cases the metric may not make sense but it's a useful exercise to go through because it will reveal metrics. You can also do this for software. So for example, mutex locks, utilization for a mutex lock would be lock hold time. Saturation would be lock contention and errors or any errors it throws. I could do it for an entire application or a microservice on the cloud as well. And if use method feels a little like queuing theory queuing theory studies utilization and queue length and saturation as well with the addition of errors in this case. Tom Wilkie came up with the red method. I think he was inspired by the use method. So this is for a higher level for a web service looking at three metrics, request rate error rate and duration and going through components from a modern application environment. And it's a similar useful exercise. So here's my functional diagram. These are the three metrics that are most important. Now I step through each component and see how I can measure them. And the result of that may be, for the use method and the red method is you come up with a dashboard so that everyone can go to the dashboard and it lists those important metrics. Thread state analysis. This is a different methodology. This is where, and I've drawn a, this is a generic thread state system. I think I borrowed a lot of this from the Bach, Unix book. You can see when a thread is on CPU, so that's colored red. So I'm switching between user mode and kernel mode. And when I go off CPU, there's lots of modes. So I can be in the runnable state. I can be swapping. I can be waiting on resources and so on. The thread state analysis method is about identifying application threads to see which of these states those threads are in. Because each of these states leads to actionable items. So if you find out that you're spending most of the time in the block state, well, you've got a lock problem. So let's fix the, let's look at the locks, the lock intention. And maybe you've got too many threads in a thread pool or whatever the issue is. If I find most of the time I'm in the swapping state, then I've run out of memory and I'm being kicked out. If I find I'm spending most of the time in the runnable state, but I'm not on CPU, that means I've exhausted CPU. This is useful because it works for any application. These are not specific to the application, they're specific to the operating system. So it's another one of these methodologies that you can apply generally, which is great. And you see thread states everywhere you lock. So like on OSX, there's thread states in instruments and it plots it over time. So you can see whether you're waiting, suspended, running on a run queue and so on. Thread states have been around for a long time. So one of the oldest ones I found was from Ristus. And you could also on 10X you could control T and it would print out the current task and tell you that the current job state. But again, the kernel tracked or the executive program is often called back then tracked these states for the running processes or tasks. I also came up with really good methodology for Solaris doing thread state analysis. I can't really use that anymore. But I've been getting it to work on BSD. So I did this in the last two days. What about thread states on BSD? Surely there is a way to come up with similar states. Detrace has a lot of events, so probes in the scheduler. So based on the detrace probes, we can see when we go on CPU and off CPU. We can see when we're in the runnable state. So we're waiting a turn on CPU. And then we can also pull out information from the thread struct. So we've got the TD state has some basic thread states but there's also TDI. So that tells us if we're sleeping, suspended, swapped, waiting on a lock, I wait and we can yank those out as well. And so I'm gonna try and demo it because I only got this to work like an hour ago but I wanted to do this as a proof of concept. But just did it quickly. Oh, isn't that so nice? It fits on an 80 by 24 screen. So it's not more than 80 characters wide. So I've printed out the command name process ID and then I've got states. CPU if you're on CPU, run queue if you're waiting your turn, sleep, suspend, swap, lock, I wait and yield. And if I do a workload, so there's a few and I can like put it in the back grid like suspended, you know, so that's I've got some suspend time. So there we go. So now I've got my checks on the bottom and I can see one of them has 2.4 seconds in the suspense state. These are, I just printed them out in milliseconds and you can see the other ones are competing for the CPU. So there's the, they're still running right. I was hoping that would fit. So there's my CPU and run queue. I can probably add some more states and flags to this because there's also flags from the struct thread. So what's useful about that is it is this, this works for any application at all and it gives you a direction to begin performance tuning. So if I am just on CPU, I should use a CPU profiler to find out what code I'm on. If I'm spending most of my time on the run queue, it means I've exhausted CPU and I should look at buying more CPUs, killing tasks that are consuming CPU and then so on you go through each of the columns and they'll direct further actions. So maybe I'll leave that running. Let's see if I can catch something else happening. So there's an output. So that's thread state analysis. And I said, if you were in the CPU state, you would use profilers. So on CPU analysis, which I've split out, something you can do to start with this, what state you're in, use your kernel state and that's pretty easy to find. You can check your CPU balance. If you know you're on CPU, are you hot on one CPU only because you don't have enough threads or your pins to a CPU. Prefile software, CPU flame graphs are great. And to get down to the low level than as PMCs and cycles per instruction flame graph. Flame graph analysis. So we're on CPU. Let's do a flame graph. It ends up if you've used flame graphs. So you have like half the room, bit more than half the room, great. So flame graphs are simply a visualization of sampled stack traces. They can also be traced stack traces. And it's really an adjacency plot. And it works really well. It works really well because the, how these work is the x-axis is the passage of time. No, sorry, the x-axis is the alphabet, not the passage of time. And the y-axis is the stack depth. And you look at the largest towers first. And so if I, since half the room haven't used them yet, if I go to one of my standard flame graphs, so I can zoom in, but the, so this is Bash running. And if I was interested in where my CPU cycles are spent, you just look for the biggest rectangles. And the biggest rectangle says Bash is in main reader loop execute command. Okay, I'm in execute command. And once I'm in execute commands, I then run this and this and this. You can split it up quickly. So you just look for the biggest rectangles. So that's where you spend most of your time. Stack trace goes up. Don't worry about the x-ordering because I sort things alphabetically so I can merge all the frames really quickly. And then you can zoom in. We use them in Netflix all the time on both Linux and Freakness D so that we can understand where CPU cycles are. And of course, if things are really thin, I don't have enough room to draw the names, right? And that's not a bad thing because if they're really thin, it means they didn't contribute much to the profile. So they're not that interesting. So just by virtue of being only able to label boxes that are big enough to label, that helps as part of the visualization as it leads your eyes into the areas that you should care about. So this discovers issues by the CPU usage and narrows the target to study. I've got the commands to do it using detrace on FreeBSD. You can also create flame graphs using PMC stat and other profiles as well. And I've got the steps for detrace online. I've also been doing things with mixed mode flame graphs. And so at Netflix for Java analysis, I do this a lot. That's where we can see when we go into the, I actually color the kernel with different color. I can see my Java code in green, C++ in yellow and user space in red. And I have not tried this on BSD yet, but it should work if you do the preserve frame pointer, which is an option I worked with Oracle to get added just so that we could do flame graphs. And so that means system profilers can then walk stack back traces. And then I use a Java perf map agent, which will do an on-demand single DOM. So if you've run into problems of profiling jittered runtimes, they're resolutions. It just depends on the runtime. So that's how I'm currently doing Java. I mentioned a CPI flame graph. So here's an example from BSD. CPI Cycles for Instruction. This example is using PMC stat so that I can see instructions and stalls. And then I've colored the output in a way that the width on the x-axis shows how often you're on CPU. So that is relative to Cycles. And the color is relative to what those Cycles were. Were they more instruction retires or were they more stall Cycles? Because if they're more stall Cycles, then you need to do memory tuning. So how can I do zero copy? What's my Numer settings? Can I just do less memory IO to improve performance? Switch to a system that has a better memory subsystem. If it's more red, then it's Cycle bound or it's Instruction bound, and then it's a different actionable items. So look at reducing code, using different order algorithms and so on. We've automated these at Netflix so the OCA team uses the CPI flame graphs for doing performance analysis. Off-CPU analysis. So on-CPU analysis is great, but what about off-CPU when we block and we can do that using an off-CPU flame graph and we can instrument when we block and what the state of the thread is. So I did one for FreeBSD. This is an off-CPU time flame graph. Let me just browse it. Open. This time the x-axis is not the sample population that's timed because instead of taking periodic samples, I'm instrumenting, I'm tracing. The wider it is, the more we're in that path off-CPU. The y-axis is still the stack depth. I've got the user stack trace and then the kernel stack trace and the user stack traces are all hex because that symbol was stripped. So I don't have the symbol, but we can fix that. You can see the kernel stack traces. This is BSDTAR, Tiring some files. Most of the time we're here in do file read, probably not surprising. Sometimes we get into B read n flags with B read. Sometimes we get into cluster read, which I believe is doing read ahead. For some reason, our BSDTAR gets into this twice. So over here's the same stuff. We've got cluster read and B read n. But there's other bits in here as well. So here's a redirectory. So I can see we spent 55 milliseconds off-CPU in redirectory and there's also this stuff, seek. For some reason, TAR was doing seek. So we spent 30 milliseconds in seek. But I can quickly identify where the bulk of the off-CPU time was. So for that profile is 1,000 milliseconds was doing file reads. If this was off-CPU for another reason, so I was blocked on locks, show up in the stack trace. So off-CPU flame graphs are a very powerful tool for tackling all the times that you block and you go off-CPU. Here's how I did it with retrace. I'm using shared off-CPU. And I record a timestamp when we go back on CPU, then do the delta, record stack traces. And I'll share these slides online and then I feed it into flame graph in color blue. One thing I might wanna do is, so here I've just filtered for the TAR command, actually BSD time. One thing you might consider doing is, it's interesting to get an insight into this, is if you actually run this on a production system, sometimes you'll get some crazy code paths where you go off-CPU. And you look at it and you think, that doesn't make any sense, but they're really thin and narrow. We seem to be just like randomly jumping off-CPU when there's no explanation for it. No explanation in the user code, but if you look at the kernel code, you'll see you're getting preempted. And so you've just got CPU saturation and these are involuntary context switches. A way to filter them out is to look at the thread state. And so I can make sure that the thread state is less than equal to one, which is in the TD state. And that should just look at the, that should exclude the involuntary context switches. So you might get a little bit more sensible off-CPU flame graph. So there I've decorated some of the paths for that one. File, read, read ahead, that was that one. Now, this one, I'm just running TAR and I'm piping it to DevNull. This one, I'm running TAR and I'm piping it into GZIP. And it looks completely different because now we're spending most of our time in pipe write because TAR is waiting for GZIP to finish its stuff and then write it back to disk. If you were a performance analyst, that's kind of interesting. Okay, so I'm spending a lot of my time blocked in pipe write. But if you tried to make the system faster, there's no explanation here. It's like I'm waiting on a pipe, but I don't know what happened on the other side. So what was it doing that took so long, especially since that dominates. So a solution to this is to do wake-up time profiling where I instrument the wake-up events. And the way I've drawn it here is the flame graph, although it's really boring, it's just one stack. GZIP has woken up BSD TAR and it's because GZIP was reading from the pipe. And so that's what we expect. There's my wake-up profiling DTRAE script. Now we can go one step further except to do that in kernel, I've had to use EVPF. And so this example is from Linux and that's where in the kernel I can merge the waker stack with the blocked stack. And so you can see TAR, this is TAR blocking and then here is the waker stack that woke us up. It would be good to get this working on BST as well. But this is a more powerful tool for doing off-CPU analysis because I would say that if you do, I've done lots and lots of off-CPU flame graphs, I think you'll find probably 70 to 80% of the stack traces are not that interesting. You're blocked on a mutex, you're blocked on a conditional variable, you need to know who work you up and their stack trace to make sense of it. So being able to merge the waker with the blocked stack is pretty important. Just to mention it as it's interesting, BPF. BPF is Berkeley Packet Filter. And so I believe this came from BST. And BPF previously was probably not that interesting in technology, I mean it's kind of interesting and it helps network filtering performance. But what made it interesting was that recently some engineers realized that it was an internal virtual machine, an internal sandbox virtual machine. So if it could be extended, it could be used to do all sorts of things like software defined networking or instrumentation like de-trace. And so that's what EBPF is. And EBPF programs look like this so that we've got more actions we can do, more registers. EBPF programs have maps, hashes, associative arrays. They can handle stack traces and they can do actions. And so I'm starting to do a lot of work on that on Linux. The front end is really difficult. In fact, I've written a lot of programs and a lot of programming languages. And EBPF assembly, raw EBPF assembly is the first language that's defeated me. I've not been able to write a program that compiles. And it's because it's its own special assembly that has been made up for EBPF. So I can't really go to stack overflow and say, who else has run into this? Because there's nothing. So it's hard. Fortunately we are putting front ends on it. So front ends could get better. At least this program fits on a slide, this is BCC, George mentioned BCC earlier. At least that program fits on a slide but if you browse them they're quite long. So they're not quite high level like detraces yet. I have been asked a few times if EBPF can solve some interesting things like the merging of stacks, can EBPF be putted to BCC? And why not? Someone's got to do it though. So you have to put your hand up to do the work. Latency colorations, correlations. Another methodology is where, say you're doing a drill down analysis from top to bottom and you know you have some latency outliers to understand where they originate from. So latency itself, it's really useful to do a histogram. So you can identify if you have multimodal latency or if you have outliers. Even better is to use latency heatmaps which I came up with a long time ago. Originally instrumented using detrace so that we could get the rich data out. And so this is an example where you can see it's a multimodal distribution. The y-axis is the latency and the x-axis is the passage of time. It's a multimodal distribution. The color depth is how many IO fell into that latency and time range. But this distribution changes over time and that could be very interesting to a performance analyst. But for latency correlations, I would go further. And this is something that I also originally did at Sun where we could measure latency from different levels like NFS, disk, application and then look at the heatmaps at each level. And if we see a problem instrumented at one level, does that appear at the next level down and the next level down? So if there's a cloud of bad latency, there's a bad storm cloud. Like this is hurting the application. Does it come from the file system? Does it come from the disks? And if you see it's in the file system but not the disks, then you might look more closely at the file system. Is there some problem with the file system like lock contention or memory allocation where it's waiting on it or something like that? Because I don't see that latency cloud coming from disks. So latency correlations is a method is to look at latency at different levels of the stack and then compare them visually to see where latency originates. Another methodology, maybe I could have done this first. Very simple one is checklists. Checklists work. I mean, I don't like them a lot because they can miss things. They're only as good as the checklist but they are useful in that it can be something you can share with a lot of people. Here's a BSD performance analysis checklist in 60 seconds. So the first 10 commands I would run on a BSD system because I wanna understand performance. We use lots of dashboards at Netflix for the cloud and in a way dashboards are checklists as well. That's just an example dashboard that we use for performance engineering. And so checklists aren't just a list of commands. You can also make a dashboard that serves as a checklist. Another favorite methodology of mine is static performance tuning. And this is first created by Richard Elling a long time ago. And the idea of static performance tuning is you look at the static state of the system without load. Imagine if someone said this system's performing really badly and then they turn off the application and use a performance engineer login. And you run top or you run uptime or whatever and load's gone. The system's now idle. Can you do anything at all on an idle system? Well, you can do lots. You can check for lots of problems. Even though the workload is not present, it is just a static or idle system. And so you can check where the disks fall because that can cause performance issues when you're approaching, when you're getting full in FFS or UFS or ZFS and they're trying to find places to put blocks. What services are turned on? What applications? How are they configured? What's memory allocations look like? How big are the caches? What's my route table look like? Is that crazy? Have I swapped to disk? Is there a whole heap that's been swapped out? What are my ZFS settings? What are my GM settings? And so on and so on. So static performance tuning. All of the, and also D-message and everything you see in KLDStand, KMV. All the static configuration of the system can be thoroughly checked. Another performance analysis methodology called the tools-based method. This is where you just trial the tools. This is not the best because you may miss things but it is very simple to do. And Martin gave a talk yesterday about doing performance analysis, black box performance analysis where he said, and I'm looking at all the metrics. I'm just taking the metrics that the system has and I'm working my way through them to solve the problem. Similar to a tools-based method where this is what the system makes it easy to dig out. Let's just quickly go through that and see if the problem sticks out. And so there's my diagram for free BST tools. I can also write separate diagrams for BST tools because they supplement the existing tools. So I'm starting to add more of these to detrace free BST so I need to publish. Actually the one I just did, T-states for this talk, I need to stick on this diagram because that does my thread-state analysis and it needs to go into the scheduler area. So it's already out of date. Some other methodologies I want to mention. Scientific method. Who has taught the scientific method at school? Okay, so maybe a quarter of people. It's taught a lot in, it depends on the country, right? So some people have been, it's been drummed into them, they no note off by heart. Some haven't. It's a useful way of posing a hypothesis and going through it and analyzing it. There's the five-wise methodology where you poke, you ask why and then based on the answer you ask why again five times. To see if you can get to the bottom of something. Process of elimination, commonly used. So being able to find tests that will eliminate what's on the operating table so what you're analyzing becomes smaller and smaller. You can also come up with a tree for that and do differential diagnosis. Intel has an interesting methodology. Anyone use top-down methodology on PMCs? Oh, buddy, it's interesting. It's just that PMCs are their own universe of stuff that are becoming more important. That's how you're trained to understand cycles and stall cycles and what the cashers are doing. And Intel has come up with this interesting methodology and they've coded into VTune as well but there's also open source tools that do it where they're breaking up cycles into parts and then so on and so on to understand what CPS are doing. And then Method R was one that's been around for a while for looking at weight states and Oracle. I want to leave you with some things you can do and that is to know what's possible on modern systems. So dynamic tracing means we can efficiently instrument any software and it allows us to play these methodology games and come up with new methodologies that in the past for older systems would have been an academic exercise. It's like, great, you want to measure thread states. We can't because the kernel doesn't do that. But now I can have that idea and I can sit down with the detrace program and write it, which I did earlier. And so it's a great time to be investigating performance and analysis methodologies because we have, things are open source and we have dynamic tracing tools and we can try them out and see what works. CPS facilities are becoming more and more important so performance monitoring counters. Who's used PMCs? Who's the PMC nerds? All right, lots of people. So PMC stat and in BSD, more and more important, performance issues are moving down there to low level CPU problems. You can get some things out of MSRs as well. Visualizations are important as well, flame graphs, lengthy heat maps. These should be commonplace. Ask questions first. So what do you want the system to answer and then we will find an answer to it. Instead of just treat, just using what the system gives you. And also, maybe you're gonna write the detrace tools and use the PMCs yourself but a lot of people pay vendors to write monitoring software, ask them to do it. And so there should be a PMC dashboard in whatever nifty monitoring software you're actually paying for. So you press a button and then it does entails top down analysis of PMCs and gives you the answer or it does threat state analysis and gives you the answer. And so we don't all have to do the hard, heavy lifting for some of these detailed methodologies. So what does someone at your company does? Maybe it's someone who looks after performance or the operating system and they can write the tools and dashboards and share them or maybe it's a third party company that's selling that as a dashboard. Be aware that dynamic tracing is very efficient and it's a different mindset. I still run into people who are thinking or who think along the lines of TCP dump and it's like, well, let's dump all the packets and then post-process. We can identify everything like retransmits. Well, with dynamic tracing, I can just trace whatever the transmit function is and I don't have to touch every packet or whatever the activity is. There's probably a kernel function that deals with it. So you can end up with much more efficient instrumentation. I really don't, touching every packet is a last resort. Can you imagine doing that when we're over 33% of the US internet traffic at night and then you have vendors coming to Netflix and say, hey, buy a product. It's just gonna instrument every packet. It's like, do you not know Netflix? We are a third of the internet traffic. We can't do that. You can't add overhead to every packet. So yeah, I've had this conversation a lot. So eventually they'll figure it out and they'll realize we're not buying their products because we can't turn them on. But you can solve a lot of these things without touching every packet. And of course, with dynamic tracing, you can really go to town. Here's an old diagram of mine from Solaris with a lot of the detrace tools I wrote. These fortunately have survived. So they're being ported and they work on BSD. They haven't completely died. And so there is under the open detrace project, there is the detrace toolkit tools and maybe that interests you to help out and see which ones still need to be ported. But I just included this slide as an example of we have a lot of instrumentation available to us nowadays. And performance monitoring counters. So there's PMCs, here's PMC groups on BSD for Intel Sandy Bridge. So you can see it's a functional diagram. So I already like it. Actually I drew this one and so I can go through the functional diagram and I can figure out how to instrument them. Actually I drew the functional diagram, then I looked at PMC stat and I looked at all the groups and then I decorated it and now I have something useful for attacking PMCs on BSD. And visualizations is the last thing I want you to have a takeaway. So we need to do better with visualizations in monitoring products. Lengthy heat maps should be a common staple. Line graphs very useful as well. So it's the crystal ball age of performance observability. Whatever questions you want can be answered. And methodologies like some I've summarized here are a great way to pose those questions. And I'll post these slides online. I've got all the references and resources. And thank you very much. Do we have any questions? Do you have a few minutes for questions? Sure. Awesome. Thanks for the talk. So I just wanted to ask, so you mentioned that BPF could be ported to the BSD. So do you see that as a replacement for the detrace or as a complementary feature or functionality which could provide more coverage? So the question is about, we had the mic anyway. So about if we were to put EBPF into BSD, we'll put it this way. BSD already has BPF. So you already have it. So it's not like we're introducing something totally new. And so it's a matter of enhancing what's already there. It would be something that I could see either detrace or another front end would use to do the advanced instrumentation like merging stacks in kernel context. So EBPF itself is just the engine that runs a program when an event fires. You still need the instrumentation part that does the dynamic or static tracing and then calls EBPF. And you still need a front end part which is a high level language for you to write things in. So there's lots of different things we can do but to start with if BSD was enhanced with the EBPF instructions, then maybe one of the smallest steps would be for detrace to be able to use it as needed. And I've already talked to some people who are interested in doing this development including for enhancing BSD firewalling. Because with BPF you can have basically an unlimited map in kernel memory that the kernel program can access and user level can access. And so user level can populate and update things. And so you can have a potentially a much faster firewall for complicated rule sets. And Cloudflare and Facebook are already using EBPF for like DDoS attack mitigation. And so let's say EBPF may show up in BSD's enhancement to existing BPF and then it's used for DDoS mitigation and security and no one uses it for observability, right? That could happen. So I would expect at some point someone's gonna put the extensions over to BSD and we'll see what happens from there. Should be at least they'll be using it for software defined networking. I'm sure George Neville-Neal can talk more about future plans and maybe George and I can George and I can talk about it and we can figure something out. Okay, no more questions. So- I think we had a question down the front. Where? Oh, sorry. That thread state analysis Chulia wrote. When can I use it? I will publish it. It's new. I may have messed something up but I will stick a link in the slide deck and then put the slide deck on slide share. If you got a slide share on there, slide share with Greg, you'll find the links. I actually have a GitHub repository of just BSD, a free BSD T-Trace tool and so I should put it in there. And tell me if I've messed something up, sir. I think it makes sense. I think it makes sense. Thank you very much. Okay.