 and we're here for broken Linux performance tools. Previously at scale I covered working Linux performance tools and I also drew this diagram to show how, where the tools provide observability which is quite popular. This is a complementary talk. This is about broken Linux performance tools, particularly observability and benchmarking. My objectives are to bust assumptions about tools and metrics. The prior talk about working performance tools, it's the sort of talk that's fun to put together as I get to talk about things that are exciting and things that work. But when you start to do a lot about performance analysis, you realize that's not what the landscape looks like. Some of the tools are exciting and work well, many of the tools don't work well and understanding what's good and what's bad is important for developing performance expertise. And so that's what this talk will help you accomplish. I will embrace the bad and talk about it and we can see the sort of solutions that help us navigate through the minefield that is performance. So you'll learn how to verify and find missing metrics and to avoid the common mistakes of benchmarking. I'm going to discuss current software and as I talk about various bugs you might think well I can fix that, yes please do. So if in a couple of years I give this talk it might be a shorter talk and maybe we'll get it down to a 30 minute talk, that would be great. So I work at Netflix and we've just launched worldwide to have this great map of where Netflix is available. And sorry, it's almost everywhere, we're doing great. And we have Linux on the cloud on AWS, tens of thousands of instances and we have FreeBSD running our CDM. It's also an awesome place to work and you might have seen the earlier talk from sysadmin to SRE because we're hiring SREs and there's many of us here today, you can talk to about that, please do. So the first section I'd like to talk about is observability. And I'll start with load averages. So something straightforward that we all should be familiar with, load averages still get used at Netflix. It's one of the signals we use for order scaling groups. So whether to scale up a cluster based on load. The problem with load averages, well there's two problems of load averages. One of them is the word load and the other is the word average. Load on many other operating systems means CPU demand but on Linux load means CPU plus uninterruptible disk AO, which can be a bit confusing. So you have that NFS server with a load average of a thousand because you have all these threads doing uninterruptible AO. So it's a bit tricky to understand the word average as well that's actually an exponentially damped moving sum. When you see the three numbers, one five and fifteen minute load averages it's not really the average for one minute. And to explain that a bit better I took a workload which was a single hot thread and I plotted the one five and fifteen minute load averages over time. You can see the one minute load average eventually settles to 1.0. That's what it should be. The load average is reflective of how many threads were in the runnable state. But that's the one minute load average. At the one minute mark of having begun this load the one minute load average is only 0.62. That's because it's an exponentially damped moving sum and it's reflective of prior history. So what does one five and fifteen minute actually mean? Well they have the constants used in the stamping equation. Or another way to put it, don't spend longer than one minute trying to understand this. It's just like a heartbeat signal to see if a server is healthy or a little bit busier. So you said that it was an infinite impulse response filter. That's right. He said thank you, it's an infinite feedback response filter. Infinite impulse response filter. And it's feeding back the original signal. That's why it's actually useful to plot things. Because when you plot things you recognize patterns that aren't so obvious from numbers. So quite often if I'm trying to understand a workload or a metric which is related to this talk you just plot it and then based on the profile like you've identified yourself if I've seen that before and I understand that now. There's a whole science in performance engineering about modeling workloads and coming up with equations for them. I often find if you just plot them you'll recognize the signals before you get that far. Top. Percent CPU. The next metric I'd like to talk about. Well this seems fairly straightforward. If I run top on a Linux system and it's telling me Java is eating 935% CPU in total. This is from one of our production instances. Then that's fine. I know this is consuming CPU, it's Java. Because this is an instance with many CPUs. So I can get that high. So this shouldn't be misleading or broken. But it can be. And I've heard of this as being posed as an interview question. Short-lived processors can be missing from top. So if you've ever done like a software build and you run top those short GCC processors and Make and BornShell they disappear. So you don't see them in tops output. So you can use things like atop which does event-based statistics and perf to get around that. One thing I particularly don't like is that short-lived processors vanish off screen updates. And so you see something and you go, wow that's the thing that's causing the problem. And then the screen's gone. So I like to run PID stat instead. It doesn't clear the screen but it keeps printing an updated interval and I can scroll back. But that's not so bad. Okay so we're missing %CPU in those situations. Misinterpreting %CPU. And I want to dwell on %CPU because it's one of the most common metrics we use to understand server behavior. So %CPU can mean different things on different servers. So I had an example of 935 %CPU. That's obviously not out of 100. So it's summing across multiple CPUs. So it can sum the CPUs. It can be a percentage of total CPU capacity. Or it can do the same things historically damped like using the exponentially damped function. It depends on the operating system. And so check, look at the man page on Linux it is summing it up so that we can see the total consumed. Another problem. When I was creating these slides I was looking for screenshots to put in there. And this is particularly interesting. The top of top has the system summary. And I can see %CPU broken down into user system, nice idle and so on. And then we've got the per-CPU percentages. What often happens is you add up the per-CPU percentages and they don't quite add up to the summary line because you missed some short-lived processes. Okay, I don't know that. In this case the per-process percentages added up to be more than the system summary. So we see more here than we are out there. Hmm, that doesn't sound right. Hands up if you think the system-wide summary is correct. Okay, so I've got like four people. Hands up if you think the per-process summary is correct. Okay, so I've got like three people. Other people are unsure. We'll see what's happening next. This sounds like a pretty good interview question. Actually, it's just trivia. It's not a horrible interview question. Go read the source code and tell me which one is right. I've got the quote there. A man with one watch knows the time. With two, he's never sure. Unless you're Mark Shuttleburton, you have that GPS clock in his house. So from %CPU load.text in the Linux source code, it says PROC STAT, which is what's giving us that summary. Sometimes it cannot be trusted at all. And that's in the kernel source code. It's like, well, that's interesting. Good thing we don't use it for anything. Actually, we use this for everything. All the performance monitoring products read PROC STAT and get the information from there. Fortunately, it's not that broken, but it can be sometimes a little bit misleading. Or just flat out wrong. Both of those values can't be right. It does get worse. So that's looking at what Top is giving us as %CPU and then looking at what the kernel is exposing as %CPU. But looking at %CPU as a metric itself, what is it anyway? A nice way to describe it is there is good %CPU and bad %CPU. So a good %CPU is we're retiring instructions. We're making forward progress on our program unless there's spin lock instructions. In which case, that's bad %CPU because you're just wasting instructions. But generally, if I'm retiring instructions on making forward progress, bad %CPU is where the %CPU is busy, but the %CPU cycles are not actually doing work. They are stalled, waiting on memory IO, waiting on resource IO, or thermal throttles or some other events. We can understand the difference between good %CPU and bad %CPU by lots of different ways. But performance monitoring counters inside processor source will tell us if you're stalled on cache misses or main memory IO or LLC or hybrid resources. But a much simpler metric, similar to load averages in terms of simplicity, is instructions per cycle. Similar to a mile per gallon but for CPUs. So in this case, it's more like gallons per mile. This is that way around, miles per gallon. So a high IPC is where we, for each CPU cycle, we are able to retire many instructions. And CPUs can do this because they execute instructions in parallel and out of order. They turn them into micro operations. A low IPC, to give you some numbers, a high IPC might be 2, 2.0. A low IPC might be 0.5. And that is, on average, for a recycle, we only make forward progress on half an instruction. So IPC, it's like a miles per gallon metric. It helps us understand whether we've got the good %CPU or bad %CPU. And that's really important, because if I'm trying to make deductions by some %CPU, like we should buy faster CPUs because we know where CPU bound, well, it may make no difference because if your memory IO bound, faster CPUs just mean you store faster, but you're not making forward progress faster. So %CPU alone is ambiguous. And it's actually a problem, a widespread problem, not just in Linux, but in all operating system tops, is they don't split %CPU into retiring versus stalls, which is what they should do, although it does get worse. To describe this with a story, things get even stranger. And like this surprised me, I was doing performance analysis of RxNeti versus Tomcat, and one of the metrics I like to use to quantify performance is CPU cycles per operation to understand these frameworks. And I found that the CPU cycles per operation for Tomcat was fairly high as I ramped up the load when I began when the CPUs were only 10%, 20% busy. As I ramped up the %CPU to 80% CPU, 90% CPU, just by throwing more load at it, the CPU cycles per operation went down. So the processor became more efficient at executing operations the more load you threw at it. Anyone tell me why? Like, it commonly happens. What's that? Speed step. Speed step is one solution. What else is there? Things get faster as you throw more load at it. Filling up threads. Caches. CPU hardware caches. The more load you're throwing at it, the more you're lighting up the caches, you may just cache better. Yes, JIT could be doing it. You're actually executing, you're fundamentally executing different code. And so at 20% CPU utilization versus 80%, or much higher load, the instructions that get executed are different because JIT chose to do something different. And so with apples and oranges, you can't compare the two. See if you could clock faster into autobus. There's actually lots of reasons for this. And as I went through them, I checked the hardware caches and wasn't hardware caches. It was basically the same. It explained a tiny bit. And the number is it was 1.8 times more efficient at high load than at low load. So I was trying to explain a 1.8x. It wasn't hardware caches. It wasn't turbobust. I did a flame graph just to see if the functions or the overall code changed from low to high utilization. That was the same. Yes. Was it real or perceived? Well, that's a good question because I'm trusting various metrics. And so how do I even know that I'm measuring percent CPU correctly? So I did know the actual delivered throughput of the system. But when you're trying to understand it in terms of CPU metrics like percent CPU, I would use other ways to double check that that was accurate which is an important lesson of this talk. You have one more IDS? Right. Yeah. As you drive load up like this schedule and the network is second or behave differently. And so it can't just be different code and a flame graph will identify. Wow. People really think about this. It's great. Branch prediction. Branch prediction in the pipeline because you're just loading it up can be different which should change. You should ultimately change IPC because you're taking the wrong branch and you're having to, depending on how that's calculated. So it's another possibility. So what's the possibilities? I went through a whole heap of them and it wasn't it. And I'll have to move on. It was the speed step which was the first answer. So thanks. Speed step really broke my mind in this case in that at low load, at 20% CPU, the kernel, speed step is driven by the kernel, it decided to run the CPUs at 1,600 MHz. And when I ramped up to high load, the kernel said, you know what, I should just run the CPUs faster. The CPUs are now going at 3,000 MHz. And I'm trying to compare cycles per instruction based on my percent CPU utilization. And I'm comparing apples and oranges. It's not the same thing. This is the hardware box I set up and I've forgotten to set the Linux governor to performance which fixes you at 3,000 or whatever. So what does this mean? It means if I tell you percent CPU, as a lot of people have helped out, percent CPU itself can be ambiguous because you need to know a lot of things. What is the IP instructions per cycle? What's the actual clock rate you were doing? Because it could be speed step or turbo boost doing it differently in order to be able to comprehend that. How did I figure this out? It was a process of elimination. And so IPC, IPC should have covered many of the bases. It should have covered cache misses, branch prediction, if it's measured right. And IPC was fairly static across the range. And so that got me thinking about actual cycles executed. So I started using those counters and pretty quickly figured out that we were completely running at different clocks. Also, you CP flamethrower to see that the code was the same and we weren't doing like polling versus interrupts and things like that. So percent CPU itself can be a complicated metric and it can still be complicated because it doesn't account for whether cycles are stalled or retiring when you get down to the micro-operation level. And so the micro-operation level we have lots of different functional units in a processor and they can execute things in parallel. And it's just a simplification to say the CPU is on or off from one moment to the next when internally you have some functional units around and some are off and it may have more internal headroom for more processing. So it actually turns out to be a really complicated metric. Another metric we like to use in Linux that turns out to be fairly complicated is IO weight. And IO weight suggests the system is disk bound but it is often misleading. So the problem is if I have higher IO weight, let's say I'm comparing an option so I'm trying a new configuration for my database and IO weight goes up. I may say that's bad because I've got more blocking and I shouldn't do that configuration change. But if I have lower IO weight, that might actually be bad too. Since IO weight, and to really try and understand I've drawn a Venn diagram, IO weight is an idle state. So if I am idle but I have pending disk IO then we call it IO weight. But I could have pending disk IO but it's just covered up by CPU. So if you have an IO weight problem on your system just run steady at home and burn up all those CPUs and you've solved the IO weight problem because you've overlapped it with CPU cycles. So that's what actually makes interpreting this complicated. But so long as we do understand it at first, understand it eventually, free memory has been another confusing metric on Linux and there's even a website, Linux8myram.com which has a picture of a penguin with a DRAM out of its mouth. A lot of these things are counter-intuitive and that's what makes things hard, is we need to learn them. When the free column goes down into a VM set or you're using a monitoring tool, it may not be bad. It depends how it's calculated. And so most operating systems use the principle if there is free memory available, use it for something useful. You can use it for the page cache or file system cache. And that's why we use free minus AM and it gives us a clue there. So initially I might have thought I had 2.6 gigabytes free. I've actually got 3.3 if I include the file system cache. Okay, so that sounds good. Now you run ZFS. ZFS hasn't hooked into this stuff yet. Free should be updated to handle ZFS. So now we get back to the problem where free memory goes down to basically zero on a system with ZFS. And the answer is, oh, don't worry about that. It's part of the ZFS arc. You need to use arc stat to figure that one out. So again, free memory can actually be complicated. You need to figure out how it's calculated. Yeah, it shows up as used instead of cache for ZFS. And in some ways you could call it a software bug in that free is trying to give us the notion of what's in the file system cache and it hasn't been updated to handle ZFS. So again, it's important to understand the source of the metrics and how they calculated. VM stat, classic Unix tool. Also difficult to comprehend on Linux. The first line has some summary since boot values, but not all of them. In fact, I can't even remember which is which. I generally have to run a test workload to figure it out. Other Unix is the first line is the summary since boot. So there is a difference on Linux. It's also used as a system-wide summary, which is pretty good, except we're missing networking. But that's okay. We can run net stats, net stat minus s, and then drown underneath all of the metrics it gives us. It's been getting better. So Linux has been adding more and more metrics like sin retransmits, which I really like. But there is a lot of metrics to wade through. I think it can be over 200. One of the problems with this is it still doesn't include everything. I don't have really decent metrics to know, say, TCP queue utilization, how much is queued from one moment to the next. And there's other stuff that I want that's actually not part of net stat minus s, which you may assume is there when you see so many statistics. There are more minus things like typos and inconsistencies. And there's often no documentation outside of the kernel source code. So it does require some expertise to comprehend. Disk metrics themselves, well, disk metrics are worse because all disk metrics are misleading. And you'll understand this if you've worked in the storage industry. Disk percent utilization or percent busy? Well, the problem is we often have logical devices that may be backed by multiple disks. So a percent busy calculation just tells us that something was active during that window of time. We don't know if half of the spindles were busy or flash drives or all of them. You don't really know the headroom, which is kind of the whole point of measuring percent busy or percent utilization. Disk IOPS can be misleading. If we're dwelling on that, we're using, say, ISDAM, we're looking at. It's high or low bad. It really depends on what you're trying to do. And disk latency. And so as people get better at understanding disk performance, they will start using benchmarking tools, looking at latency histograms of disks. But even then, it may not be what you think it is. There's been decades of work to make disk IO asynchronous to the application. And so let's have write back buffers and let's have read caches and all sorts of different levels throughout the stack. And so in the real world, there's so much engineering work has gone into avoiding your application from ever going near the disks. And so dwelling too much on disk latency, you have to ask why is it mattered that much? File system latency matters more because you're going to talk to the file system. So I find it much more useful to measure it there. File system, much more useful. That's what your application talks to. It has a big file system cache, but we don't have any file system cache hit or miss metrics in Linux. So this is just completely missing. We do need a hit-miss ratio. It's something I did hack up with Ftrace and another port's been hacked up with eBPF so that at least we can get the hit-miss ratio out of the file system cache. So so far I've covered many metrics that are misleading, some that are wrong and some that are missing. And so just to pause and say what you can do about this, you need to verify and understand what the metrics are if they're important. Realistically, you don't have time to verify and understand the 50 or 60 metrics your monitoring tool may be giving you. And that's okay so long as you're aware of which metrics have I verified and which haven't I. And so when you're using them to solve issues, these three are known to be good because I spent half a day reading the Linux kernel source and testing them and they work. All the other metrics, I don't know that they're good, but I can use them as clues. And so that's a mindset that's really helpful knowing what is proven to have worked and knowing what you've yet to have the time to do to prove and verify. And so you can cross-check with other tools. Dynamic tracing is great for that. Test with known workloads. So you might have it. A benchmark tool can actually be useful for testing observability metrics. So I know this is my supposed throughput. What does the observability tool tell me? Reading the source obviously and using known to be good metrics. To find a missing metric, that's even harder. And so methodologies like the use method those questions for metrics to answer. And that's a great way to discover that you're missing metrics. Also if you draw a functional diagram of the environment or Linux or the application to try and figure out which parts you don't have observability for. Sometimes it gets a bit depressing and you think, can't we just burn it all down and start from scratch? And there's just been so much work in systems, metrics and understanding and documenting that it's hard to do. Just as a wild speculation, there is one environment where they already have burnt all the metrics down and they kind of do need to invent them from scratch and that's some of the Unicernals I've looked at. And so you can't log in and run VM stat and IS stat and PS and top and get confused by all those metrics we've had for decades. And some of the engineers are already thinking about the problem of if we have to reinvent it all, what should we do? So I'm not saying that will be the solution but it's a surprising opportunity that maybe someone will figure out what metrics we should have given a clean slate. Profilers. Linux Perf is a great profiler and this can be used to do things like there's another observability tool but I'm trying to understand how the CPUs are being used by copars. It has this nice hierarchy view, tree view. Unfortunately, it's very verbose for 13,000 lines and that's kind of a problem with today's profilers is the output can be onerous to read through. That's the full output in one image and this is why it came up with Flamegrass which uses a hierarchical visualization called an icicle plot that is upside down and it shows the same data we saw in that the impossible slide of text and I can much more easily navigate it. Great. Sounds like profilers are a solved problem. We can just do Flamegrass and everything works. Unfortunately, there are lots of issues of profilers. Yes, you had a question? If you just Google Flamegrass or DuckDuckGo Flamegrass you'll find all the steps are online. Visibility. So when we tried to do this we found you can use Java profilers which Netflix runs a lot of Java and they typically will show you the Java code which is of color green but they can't see outside of the JVM so I can't see kernel, libraries. I have some visibility of GC and there are various other problems with those profilers as well. Inaccurate or incomplete profilers, profiles. System profiles like what Linux has with Perth events that's great because I can see kernel activity but I'm missing stacks and symbols so I can't see the Java methods. And this is one of the examples which I wanted to include in a broken performance tool talk where that's the real world. You often find the profilers don't work and it requires some engineering time to fix this. We did this work at Netflix last year and now JDK8 Update 60 has the preserved frame pointer so we can do proper system profiling. But don't assume profilers will work out of the box it may be a few weeks of work to get them to see your code. The particular problem here was compiler optimizations and it's how the kernel likes to work stack traces or profilers like to work stack traces using the frame pointer register and compilers have reused that as a general purpose register for years. So that's what GCC has minus F no omit frame pointer which you should always use because it helps debuggers and profilers and how Netflix will help to get preserved frame pointer put into Java for the same purpose. Missing symbols just to mention it if you're getting into profilers you'll often find you can profile stack traces but you don't have symbols. You just have these hexadecimal addresses. It's just another problem to work with. Netflix does have some solutions so Perseverance will look for these supplemental symbol files and there's ways to create them for Java and Node.js. If you try and do instruction profiling you find that instruction profiling has actually kind of been broken for many years. So here I wrote an assembly program that just does knots in a loop and then I profiled it and I found that somehow the CPU was jumping from one knot to the next. That's the percentages of samples and you could never see the instructions in between and there's various reasons why instruction profiling has been broken for many years scared out of order execution and sampling the resumption instruction. This is why more recently Intel has come up with things like PEP support which Winix supports so that you can get precise event-based sampling. So if you're going to go into profilers some happy things of what you can do to address these problems do get stack trace profiling working. So get stack traces to work in symbols to work. It's worth it because you can then do things like create flame graphs. Also for observability tools it's important to understand overhead. So TCP dump seems to be the genesis for many tools and people do GUIs on top of TCP dump. I've solved lots of these issues of TCP dump it's great but you do need to understand overhead and I'm always wary of anything that does per event-based dumping. Try TCP dump on a 10 or 100 gigabit Ethernet system and you'll see dumping all of the packets even just the headers starts to become starting to incur high overhead. The screenshot I've got there was dropping packets so it couldn't keep up. So there's various overhead costs of doing this. What I try and do instead is use say dynamic tracing and go into the kernel and do frequency counts of TCP events. If you have no other tool and you use TCP dump and you pay a price but you ultimately solve the problem that's okay. The most important thing about this is to understand there is a price to pay. S trace is a much bigger price to pay. An example here, a worst case example DD of one byte from dev0 to dev0 is running 400 times slower if I run that through S trace. What's interesting here is I'm S tracing the accept syscall which DD is not doing at all. It's not even doing the accept syscall but it's that much slower and that's because of the way it's doing instrumentation. It has to instrument all the syscalls before it can do the filter. And also the way S trace currently works with P trace is like putting medium lights on your application. You can set a breakpoint when you enter syscall and when you exit in context switches for every system call. So we should use more modern traces instead like Perth events. And they're much better but there are some hidden dangers of things like Perth events and system tap and S trace. And just to come up with an example here I've said, well Perth events I can record the scheduler switch event. It's pretty useful. Let's also use Colstax and it wrote a 100 megabyte Perth.data file. So that's a lot of data. And this is only for one second. Imagine if I typed 100 seconds like I'd have a 10 gigabyte Perth.data trace file. And my problem is I've traced something that's very frequent. So the scheduler switch event. When you use traces and the same is true for all the traces including say EBPF and D trace you have to be aware of the overhead of what you're doing. And so there's various costs. So if I'm dumping every event that's high cost like to a TCP dump file or Perth.data file if I'm doing internal aggregations it's much better. And then it's the frequencies of events. They need to develop some idea of how quick is the thing I'm trying to trace. Scheduler events are really quick. So I'm always careful if I go near the scheduler. But there are lower frequency events like this guy is generally lower frequency process creation and destruction. Of course some tools make it pretty clear that they're going to slow the application down like Valgrind has in its documentation. I will make your application run 20 to 30 times slower. So at least in this case it does warn the end user. Java profilers can be even worse depending on how they work if you run these on Linux. Sometimes they have two modes, a sampling of stack traces like at 100 hertz of tracing methods. And by now you should identify tracing methods could be the really expensive one because of the frequency of events. I could do millions of methods a second. And they could slow the target. What I find weird is the documentation for these can describe method timing is highly accurate even though your app runs a thousand times slower. It's like how can it be highly accurate if my app is running a thousand times slower if everything is going a thousand times slower like race conditions could be different like networking events are all different. So I did see a good talk about profilers are lying hobbits as they went into more detail. So for overhead for profilers you do need to understand how the profiler works. How is it instrumenting what it's telling us and what's the frequency of events. Generally if I'm going to do internal summaries and it's less than 10,000 events a second it should be negligible. If it's more than 100,000 events per second you start to be able to measure it. Monitoring. Monitoring I can mention quickly because a lot of the risks involved for Linux monitoring are the same that we saw for tools and metrics. So the tools can be quite misleading on the metrics they're telling you. They can be missing metrics like a false system cache hit miss ratio and some of the network statistics. So the problem that I see many monitoring tools do is they assume the system metrics are perfect. And the problem is let's just plot them, let's just grasp them. And of course whenever I talk to vendors that's not the problem. The problem is the metrics are broken and they're misleading and we're missing all these metrics. I don't need someone to grasp this stuff I need someone to fix this stuff. So that's kind of the problem we have. Another issue with monitoring is if it's built on some idea of event tracing where we'll trace every event and then post-process and that can incur a massive amount of overhead. And of course doing this on a cloud wide can make things all the more worse. So if your monitoring product is supposed to work on the Netflix Linux cloud of tens of thousands of instances then how much data do you need to push around to the monitoring servers to work. Statistics themselves can be quite misleading and averages can be misleading because they hide latency outliers permanent averages can hide multi-second issues. I had a customer with an issue in a prior job where their CPUs I knew they had a CPU issue but they were convinced they didn't because their monitoring system said they peaked at 80% CPU. 80% was the one minute average and when we looked at the per second average there was flatlining and then there would be parallel and flatlining. So always understanding how the statistics are calculated on metrics. Percentages can be misleading as well so if I hit my 99.9th latency intuitively that sounds pretty rare but if I'm processing millions of events a second like we are at Netflix that happens all the time and so we need to look more closely at those. What we like to do is to look at the distribution and so there's lots of ways to do that. This example I've plotted the latency of disk IO latency on the x-axis frequency of disk IO events on the y-axis and it's multimodal so you have a low latency distribution of like half a millisecond then a higher mode at 1.2 milliseconds the average is in between them which is just an interesting real world example it's when the average is supposed to be the index of central tendency in this case it's not it bifurcates the modes so that is quite misleading. If you're looking at averages then what you want to do is you also want to look at the distribution so that you understand if you have a multimodal distribution or a long tail and the average is not what you think. Speaking of misleading things visualizations can be misleading as well and if you've seen my talks before you know I'm not a big fan of traffic lights, people putting red and green colors and things and so on Linux I've got these stats and H-top and I like those tools, the tools are innovative in all sorts of different ways personally not a big fan of the colors if you like to use the colors and you find them useful that's good but for me when I see things like my system time is 80% that's green that's good and 20% user that's red that's bad for me I'll color them around the other way 80% system time it kind of sounds like the Linux kernel is doing numerous rebalancing which we've seen and like just burning CPU in the kernel and H-top has got its own color highlighting from the same workload which can be more misleading than useful so I find traffic lights are good for objective metrics like actual failures they can be misleading for subjective metrics where it can be latency or eye-ups and it's not clear it's based on your business requirements what is good or bad I'm not a big fan of Tacoma, there's either there's at least one Linux monitoring product that likes to use them especially with arbitrary color highlighting or pie charts for real-time metrics so what you can do for that section for monitoring it's the same as the observability tools you need to be able to verify metrics and then understand overhead so statistics ask how is this calculated is it an average over an interval what's the calculation and look at the full distribution just looking at it you might see that you've got multiple modes or the shape of it and for visualizations you do want to use histograms and heat maps and flame graphs that they add value the last section I've got is benchmarking and benchmarking is extremely error-prone to the point where I would say that almost 100% of benchmarks are wrong and there's a nice quote I've taken from a paper called a 9-year study of file system and storage benchmarking that sounds like hell, who'd do that for 9 years what I've said most popular benchmarks are flawed you can see how restrained they are when they wrote that and also and this is what catches people all alternates can be flawed as well and so if people tell me Brendan I need to benchmark this I'm going to use bonnie plus plus whatever it is and I say no no don't use that it's misleading, there's bugs, there's overhead you'll shoot yourself in the foot and they say okay what's the alternate what should I use and sometimes it's like there's nothing there's absolutely no benchmark that you will not hurt yourself with for this area you're going to have to create yourself and some people really struggle to get the head around this but this is really popular there's like 20 benchmark alternates on the internet can't one of them be true it's like no actually none of them can be true they can all be wrong like you need to understand this some common mistakes testing the wrong target very common so you download some benchmarks seems like a good idea but you're testing something that doesn't matter for your actual workload and so that's where I think I'm evaluating a storage product and I think I'm testing the disks but I'm hitting out of the file system cache and so I was like wow my disks are doing like 10 gigabytes a second for one spindle I was like no that's your file system cache you're testing the wrong target choosing the wrong target so at that point you say okay I will do direct I own or disable file system cache I will test the disks why? it's not actually real world as I said earlier because in the real world you will have file system cache they may as well test through the file system cache we just know what you're testing invalid results sometimes benchmarks do have bugs just like observability tools and metrics ignoring errors a great one I see is and I've had this myself where I get a great result out of a web server I think wow look at this I did a million events per second I should probably it's that way in a minute these are HTTP 500 they're all broken we discovered that the error path is faster than the success path so if you're wicked you'd use that in your benchmarking results look how I broke a million events a second it's like yes and they all failed it's because in the function you do the error checking at the top and then like return error and then you do the actual work so it's no surprise the error path can actually be quicker ignoring variance of perturbations is not a common error but so the real world workload is not matter and so we call it like sunny day performance testing versus rainy day performance testing with sunny day testing a maximum throughput when it's a perfect workload but in the real world it's more like a rainy day where there's variations perturbations you just don't have any insight into that because you never tested it and of course misleading results if you're not paying too much attention it seems that benchmark A actually measures something else and conclude you mentioned you've measured a third thing just to go through the types of benchmarks quickly some micro benchmarks these are the ones that test the specific function in isolation and so file system maximum cache read operations per second network maximum throughput these are pretty useful because they're easy to debug and so if you're trying to understand why micro benchmark is regressed there's not that much on the analysis table to get to the bottom off like get pit in a tight loop of the speed of dev zero and that's the problem it's easy to also test things that aren't very relevant so you need to or you're missing workloads that are relevant so micro benchmarks they're a useful tool you need to match it to a real world intended workload and one way people do this is let's do macro benchmarks that's where I will do my full client simulation of logging in doing stuff and logging out common problems is misplaced trust where you believe your macro benchmark is must be realistic but it's not it's just like everything else it can have lots of problems you need to debug and they're also complex to debug now that you have everything on the operating table you have to figure it all out so now you've found a regression with a macro benchmark I would try to reproduce using micro benchmark just to make it quicker to analyze kitchen sink benchmarks are also popular let's run everything and then come up with like a value like an average across them all lots of problems the misses of more benchmarks is great accuracy and it's more benchmarks is just more opportunities for errors to mention a few in particular Bunny++ is a popular hard disk benchmark or the website says it is a does a simple test of hard drive and file system performance I had to do an analysis of this once and I began with the first metric which was the per character sequential output and I found that what it was actually testing was it was being one byte writes to libc by putting libc would then buffer it and by the time libc talked to the file system it was doing 4 kilobyte writes the file system and volume manager did its own buffering and placement on displacement and grouping and by the time it talked to the disks it was doing 128 kilobyte asynchronous writes so I had a customer who was worried about the disk result the per character sequential output thinking it's a disk result it's got so little to do with the actual disk by the time you're talking to the disk you're doing something totally different and bizarrely found things like I can actually tune this, I can set buffer I can change the buffering inside libc anyone ever tuned to set buffer in libc someone has excellent so there's a learning experience and of course there could be other things I could have iron mice turned on and so your benchmark is accidentally testing a Linux iotrottle so really error-prone, you really have to get to the bottom of it Bonnie++ did update their code so this particular problem now does direct IO so we'll actually do one byte disk IO but just as an example of the sort of many issues you can run into another common one is Apache Bench single thread limited, so so many times I see a result but you're actually limited by Apache Bench being single thread which is why people use WRK another problem is there aren't some problems with AB's code but there is another problem and it's not so much AB's fault but whether you use Keep Alive or not and so I've had the situation where if I don't use Keep Alive it becomes an unrealistic TCP stack benchmark this is creating and destroying sessions for every request and like your kernel system time goes through the roof it's like okay well did you mean to test the TCP IP stack I think you're trying to test like a real world workload and if you use Keep Alive then it can go at light speed because it keeps alive everything and so now you become an unrealistic server throughput test so again it's misleading to begin with because you think it's going to give you a nice simple result but you really have to dig in and think about it to get value and also what I mentioned is Unix Bench Unix Bench still exists it's the original partition sync benchmark from 1984 published in Byte magazine lots and lots of you'll run lots and lots of micro benchmarks that makes up the byte index things like pipe throughput and pipe based context switching many problems many many problems I've been meaning to write a blog series about all the Unix Bench problems in the hopes that I can stop people from using it unfortunately I only got as far as the make file so it tells you you run this shell script called run and then it automatically creates a make file for you so I ran run and this is what it does this is all hashed out look for Solaris 2 let's use these options I'm running this on Linux I'm running this on Ubuntu and it's picking up Solaris 2 options in the computer history museum last night there was a copy of Solaris 8 sitting there this is Solaris 2 and that's actually what Unix Bench is using as part of its make file but the problem is it's picking the wrong thing it should be picking the next thing let's uncomment this if you start looking at it you might think should it be minus 02 or minus 03 maybe I'll change that nooo don't change it what are you doing you're changing the benchmark result and like you're baking in your choices into the numbers you forget you did that and you'll be telling your friends look my Unix Bench was faster than yours it must be the server if you went and hacked at it why am I doing Solaris 2 I should go and uncomment this and hack away so that's a really big problem the Unix Bench documentation says hey the results depend on not only your hardware but also your system libraries and even compiler so and it says if you want to publish any results you should put your compiler versions with your results but the interesting thing is the documentation told you all along this is going to include the compiler and how you're doing it except I have, it's rare that you see someone publish the compiler settings with the Unix Bench results and people compile it up on different systems it's really problematic to try and compare those numbers with a series of Unix Bench one day and actually go through the micro benchmarks and might end up misleading but I got a little bit depressed just at the make file it was innovating and useful at the time but its time has passed so what you can do about benchmarking match the benchmark to your workload and do this methodology called active benchmarking where you configure the benchmark to run in steady state 24 by 7 and you do record analysis while the benchmark is running you can answer the question why is the benchmark result X and not 10 times X because if you can answer that question you can answer the limiter and then you discover the limiter is stupid because you configured it wrong or you're testing the wrong thing and so you just go through the normal process of root cause analysis this is great to learn performance actually because if you do root cause analysis on a production system you don't really know what the limiter should be in summary observe everything so how to get out of trouble, trust nothing even though everyone is using that metric it doesn't mean it's right metrics often can be misleading metrics can often be missing as well so you want to pose questions first and then find the metrics to satisfy them you might have to create new metrics so that's why I like my functional diagrams profile everything so like java mix mode this will solve a lot of issues it's really useful for benchmarking as well to show what code is actually executing and what's on the table visualize everything so I want to see histograms of latency I want to see heat maps of latency which is a histogram over time so here's my bimodal disk IO and I can see there's a wide mode every 5 seconds or so great that's what I want to see more information and finally benchmark nothing just stop doing benchmarking and then the problem will be solved or if you must do benchmarking please do active benchmarking as we analyze what's going on and get to the root cause so I've got in my slide deck I'll post it on slideshow I've got links and resources especially things that aren't broken like I said this is a complimentary for some scale talks where I talked about performance tools and things that work so if you leave this a little bit depressed there are suggestions of things to do and you can also check out my earlier talks about things that work and that's my talk thank you very much and we do have time for questions I'll say before anyone runs away there are some Netflix people here as well if you'd like to talk to us about working in Netflix it's an awesome place to work in 2016 so please hit up one of us the questions yes what's my favorite tool for NFS benchmarking I used to do a lot of NFS benchmarking you know at one point it's starting when I was doing it I wrote a poll program at NFS benchmark because I was so sick and tired of the various proper tools being broken and it got to the point where they said my poll program was the gold standard of benchmarking it's like wow you have to be kidding me but the point was there was a simple program that used the interface I actually published it and put it on you can still probably find it on article.com the point was it was a very simple program I just did syscalls directly I should have written it in C like a nice simple C program that you can debug you understand what it does because as soon as you start to use more complicated things things get off the rails if I had to do benchmarking now file system benchmarking I do like FIO FIO is nice it gives you some distribution information on the output it does assume a type of distribution but it's not too bad so if you're going to hit up a a commonly used one these days FIO is good it's a minefield lots of bad yes benchmarking is a pretty depressing area and the question was wasn't there times when compilers detected that they may be eventually benchmarked and changed the compiled output and yes there's stories like that you have to debug everything and know what's going on it's pretty crazy did you see the question yes so what was the question what's the relationship between processor queue links and CPU usage so processor queue links so run queue links on Linux is a nice measurement of saturation so how many threads have been queued up waiting for their turn percent CPU utilization is how busy a CPU was during a given interval now you mentally may say okay my model is I will go from zero to 100% utilization and then if you keep throwing threads at the CPUs I will then get queuing and that sometimes happens but sometimes it looks a bit different so sometimes you'll get to say 50% utilization and then you start to see queuing I think what how can I have queuing for CPUs is when I have all this headroom and so again you debug it and you find out that you're running Erlang and it's P bound to the CPUs and they can't use the headroom because they can't un-P by themselves or you find out it's because of statistics and the sample interval and so you may be sampling it for it may be one second during which 500 milliseconds the CPUs are flat out and you have lots of queuing and for the remainder of 500 milliseconds the CPUs are idle and you start with percent CPU utilization at 50% but you have queuing and it's like how can this be so always get to the bottom of things really useful and you'll learn what's going on yes okay, the question is have I seen it where the links, push events doesn't detect kernel symbols no but we we are on Ubuntu and things work pretty well I do have some like my own compiled kernels where if I mess up the compile I can break perf and like the way it fetches symbols it just sounds like if you I don't think that's a perf problem I could be wrong but I just get to the bottom if perf can't recognize kernel symbols yeah, if you debug in perf and symbols like run strace minus fc open on perf and find out all the files it's opening and see if it's going to the wrong directories you just have to go through some debugging yeah, use ftrace to debug perf I know it's perf to debug ftrace right find out why it's simple without working next question so the first question is how do I measure ipc and the linux perf command does that so you did epmu access I've got a pc under my desk for that stuff and if you run perf stat command it will give you the ipc summary if you run perf stat minus a sleep 10 or do it for 10 seconds if you go to my home page I've got a lot of perf examples I've got a separate perf events website and I've got all the perf examples for measuring ipc and cache misses and all that stuff that's a really cool command on linux it's a great go to tool for performance monitoring counters next question was have you sysdig yes I contributed some chisels to sysdig I did the spectrograph for sub second offset heat maps for sysdig sysdig is pretty cool, it's nice to see innovation in the tracing space what I like about sysdig is that it is an event based tracer and so it passes all the events down to user level and then processes them and I would rather see that summarization happen in kernel however they have done a pretty good job of making the overhead as low as possible so it's not my first instinct is like I can't see myself using sysdig, the overhead will be too high and then I test it and it's like oh it's actually done a pretty good job of lowering it it's pretty interesting, they are doing innovative stuff in sysdig I'd love to see sysdig pick up ebps the kernel engine which will allow them to do some in kernel summaries with the question do I see differences between problems found on real hardware and virtualization yeah when you're virtualized there's just a whole heap of different issues so like getting time stamps can be different and like we've made issues where we've had to tune our clock and use a different clock source in Linux and like I've never had to do that in hardware but then it's something that's virtualized, it's a hyper call and so yeah there are times when you need to know of course you're using powered drive this sort of coalescing IO together it's a different, it is a different beast but if you're really good at one so let's say you come from a hardware background you go to the cloud, you will pick it up because you've worked on similar issues before but definitely it's a different beast to work with yes so the question was what tools do I use to analyze memory if it's just high level memory statistics then you do get a lot of the flash proc counters flash proc memory info you've got ps it should be fine you do need to understand like the difference between shared memory and then dirty memory whether the application has dirty it so that it's actually page faulted and it's allocated physical main memory and so I use pmap-x a lot pmap so pmap-x and the process ID really fast on Linux, not so fast on other operating systems it's really fast and it will dump the memory address space and you can see the different types of memory so use that to get to the bottom of a lot of process memory issues if it's kernel memory things like slab info and then other stuff out of proc yes so comparing Linux to FreeBSD in terms of say observability tools it's different at some point I'd like to do a like a differences talk because it'd be pretty interesting so FreeBSD there's a lot of common things like you've got your tops and your ps's and your vm stats and ire stats and what not one thing I like about FreeBSD is pmc stat has a better by default grouping of performance bunch encounters you've got detrace on FreeBSD which is really easy to use walk through kernel like I've written lots lots of detrace scripts so that's currently much more developed on FreeBSD than the equivalent right now on Linux so right now on Linux I'm using things like ftrace and perseverance and ebps to do kernel based tracing things that are inbuilt into the kernel so it's different but FreeBSD is pretty good and like right now until things like ebps and ftrace surpass generally depends on the performance issue I'm working on but I can often have a better time on FreeBSD than I do on Linux just because there's a lot of stuff there it's a nice thing to check out if you haven't checked out already and of course Netflix has lots of FreeBSD on our OCA's which is our CDM I'll take one more question and then we'll break I saw one more question ok thank you right