 Good day. My name is Brendan and CPU utilization is wrong. This is my five minute public service announcement Now you may say that utilization is a metric has always been wrong But for CPU utilization it is particularly wrong and it is getting worse as CPUs get faster Yes, I'm talking about CPU utilization. This is a metric we use everywhere. It's not only all our dashboards. It's in our tools It's used by Netflix in auto scaling rules and it's a metric that's very old. It's been around since time-sharing systems How it works is the kernel measures CPU utilization as the time from when a thread begins running on CPU to when it stops However, the kernel assumes that the CPU is utilized the entire time What can happen is? Instructions they're running stop and they are stalled waiting on an external resource the kernel chocks this up as CPU Utilization to give you a visual idea of how this works imagine. I told you your CPUs were 90% busy You might think of it like this However, what's really happening is that the CPUs are 20% busy and they're spending 70% of their time stalled Not making forward progress on instructions is they're waiting on external resources This ratio between 20 and 70 that I've drawn. This is what we see in the Netflix cloud So this is real and it's getting worse. It's getting worse as CPUs get faster To really explain this in more detail, I'll show it an example a case study. This is my SQL This is running the same workload on two servers The second server is 27 percent slower than the first and if you look at the metrics, it's CPU bound Wow, I love issues that a CPU bound because it means I get to use CPU flame graphs And I would expect to see for the second server some tower in code. That's 27 percent of the overall time However, when I did flame graphs for both of these servers, the code was basically the same The widths were just a little bit different. So there's about 10 percent difference, but there's no extra code here to explain it Now I do see this from time to time and the reason is there is something happening at a lower level than the code And so this is where I need to look inside the CPU and I need to look at performance monitoring counters and model specific registers to understand it My first stop at understanding low level CPU issues is to look at the real clock rate That's running now I've written a bunch of open source tools to help do this So this one is showboost and it uses MSRs to show that the two servers were actually running at 3300 megahertz So that doesn't actually explain the difference One server can run at a different speed to another because it's in a colder part of the data center And it's allowed to turbo boost more quickly. So I always run showboost just to check There are other tools that can look at this as well. You don't have to use my open source tools You can use other ones, but that's my first stop Next is instructions per cycle. Here's another open source tool Instructions per cycle is like miles per gallon It shows how many instructions completed for the CPU cycles that were consumed the higher the better and Server B had a lower instructions per cycle 20% lower Which is a very big clue that this really is a server a CPU low level issue So now I can then find out why IPC is low IPC is low because of stall cycles And I wrote many other tools this one is TLB stat and it shows the There is 16% more cycles in TLB misses. So the TLB is the translation look-aside buffer That's what the memory management unit uses as a cache for virtually physical address translation Whatever instructions doing loads and stores and if the TLB cache can't Return an entry the MMU must talk to main memory and do page table walks, which are very slow Now TLB issues. I haven't seen TLB issues in a long long time It seems very strange that I'd pick a TLB issue as my case study for CPU utilization is wrong Now there's a reason I picked this and I'll just give you a few moments to think about Do you expect to see such a TLB issue in the real world my next slide will show what was causing it? Yes, you will see this in the real world This was caused by the KPTI patches that have merged in Linux 4.15 for the meltdown vulnerability These have been back ported to all of the other kernels the patches cause TLB flushes that causes stall cycles in the CPU This is reported as CPU utilization and that is misleading. I have a blog post about it and that is my talk