 All right, so actually what a conference to have Leonard as kind of an opening act to To my talk. So this is gonna be a kind of a general talk about system performance With systems, I mean Linux systems and to be more specific. It's basically about Linux servers Before I start a little bit about myself I actually like computers a lot. I like them fast and I've been lucky enough to always find employers Who kind of provide me those so I used to work for AMD In in Dresden, which is about two hours south of where we are right now And I was there when we had the first one gigahertz processor or x86 processor I should say I was there when AMD introduced the x86 46 bit extensions That was quite an interesting time and since five and a half years. I'm now was was amazon and what we do in Dresden is Basically all the kernel and hypervisors that amazon runs on their servers and I truly can tell you know And the kernel and hypervisors is coming From a group that has multiple offices, but A big one is in Dresden I promise to put this up. So we are hiring If you're interested in this have a look or talk to us several people here All right, but for the talk talk, so I agree the the title is a little silly because What does it really mean to be fast? That guy here He was fast so about two weeks ago in Berlin. There was marten and the guy Run it in in a little bit over two hours two hours one minute or so Cut a new world record An interesting thing there is You know, you have one number and you can compare them and that's really it, you know you say Go and you measure a wall clock time and when the guy comes back in after the 42 kilometers You have a number and you can compare those things Computer systems, unfortunately, why not unfortunately, but they are way more complex than that and there are way more things you you can measure So Let's have a look how Um, uh, the desktop is is presenting you basically an idea of of how fast it is so Gnome has this system monitor. They call that and on top you see, uh, cpu usage over time You see memory usage and and network usage over time. I mean not a big surprise, but I probably would also add Uh Your your hard drive, uh That's also an interesting, uh number to see how much this guy we have um, I guess most of you, uh with ssh into server and, uh Run something like top or htop, uh, which gives you a little bit more, uh information um One of the numbers you probably cannot really read here is, uh Is load um I remember back at the days at at AMD So I basically started in the it infrastructure team there and we had a weekly ops meeting for database service and all those things And one of the numbers we always looked at was load and I think every other week or so there Started the big discussion what this figure actually means and if it's useful to look at it or not and Uh, and we repeated this over and over again and that is Basically the slide is for my co-worker. So, um From the documentation in the file that is implemented, uh, it basically is the number of running processes plus the number of uninterruptible processes um in the run queue of the news kernel And since this number goes up and down like crazy, you want to have an average over one five and And 15 minutes and that was an exponential decay um It's totally interesting to a look of the To check out the history of you know, where where load actually came from and, uh Here is a quote from the guy, uh, materials always who actually added this, uh, uninterruptible Uh, processes which at the time mainly meant, uh, this guy And you can read all of that, uh In a blog post from brand greg who was a hero when it comes to to performance measurements But I will quote him a couple of times in this in this talk um The thing to remember is load is a very good indicator for humans to look at the system and see What it's doing, but if you start to think about it for longer than 10 seconds What this really means and if this is actually telling you anything then look for something else um Before we dive into all of that a little bit deeper, um, I'd like to come to a common understanding of, uh Some statistics and and chart types Um, so the first thing is, uh the mean value versus the median, um Mean basically means you have a bunch of numbers and With the the sum of this and and uh divide by the count of those Median is you line them all up Order them and then pick the one in the middle. Um Interestingly, uh, the median is the one you should choose for performance analysis in in many cases Or probably most cases because it's way more robust. You will see that in a in a second um Another thing, uh, you should be familiar with because, uh, that is extremely helpful is, uh things like mean standard deviation percentiles so that is probably probability density function of, uh, the normal distribution um Percentile basically means again you you order all your, uh, your measurements In in a long row order them by size or by by by number and then you would have the 50 percent percentile right in the middle um for this, uh normal distribution, uh, that the mean and, uh, the median are exactly at the same point right in the middle and well, uh, at two sigma you have the 79 dot whatever, uh, percentile and that means that basically, you know, all your measurements, uh Or that 97 percent of your measurements are below this this number So why is that important? Um, just imagine, uh You measure, um over time uh Some apis and some web api, you know, which you query um And what you're interested in how long does it take to to get the response latency for that thing? um The green thing is the the median and you see that this number is pretty robust, right? So if you are actually if you want to eliminate outliers from from your matrix, uh That is really the thing to use The light blue one is, uh, the mean value where you see this is indicating a little bit And if you serve this service, uh, to to real users, um, you're probably interested in in the percentile That I think is the 95 percentile here and you really see how this How this goes up and that still means that five percent of your users have a have a worse experience at this point Which is not good all right, another thing to Uh, always keep in mind when you do performance measurements is caches. Um So here's a guy, um, who measured, uh Latency I believe for his his hard drive And what do you see is in this chart in this histogram here? Uh, it goes up and it goes down and it goes up again and Well, by the title you probably will have guessed Here you basically measure cache. So it might be you really measure just, uh, access to memory here Where that probably is where your system has to really scrape things off the hard drive um, and so of course the first comment on this block was, uh, to use, uh or direct Or direct IO to basically bypass the the cache um For latency. Yeah, I was expecting that this is hard to read here. Um Things to keep in mind. So, uh If you want to check that out, there are many more examples, uh on on this gist um If you say that one cpu cycle which takes about Zero three nanoseconds, uh, would be one second, right? Uh, then A level three access cache access, uh, would be some like, uh 43 seconds, which is quite some time compared to um One cache access which is three seconds. However If you go further and you go you talk to real memory or you have to talk to real memory That's already would be six minutes. Um I mean you can read all of that all of that here, but, uh Another number, uh, a tcp retransmit which One two three seconds in in this perspective would already be something like, uh 100 years and A physical system reboot, um Um Let's say this takes five minutes They had and and that's why I put this up here That means, uh, at this time if you would take this time back in in the world history, uh, that was when the Sahara still was an ocean um, so that is something, uh I think you should remember from time to time that really caches are a very good idea and Well, if you if you have too much network i o That already adds a lot of uh of latency So the I took this, uh table out of this book Um to be honest with you, um, you should be reading this book and not, uh, listening to me Brent greg is really one of the Of the guys to listen to, um I truly highly recommend this book I would know a good place where to order it, but I leave that to you um We have seen a couple of uh chart types already. I had this line chart. I had the histogram bar charts, uh This has nothing to do with with computers, but uh, I wanted to introduce the concept of those, uh scatter plots so this is basically, uh How much people pay in a restaurant for their bill? Uh, sorry, how much tip people give in a restaurant depending on the price of the bill and of the gender So what do you see if you go out for lunch? That's Pretty much the same. So in this chart you have multiple dimensions. You have gender Um, you have the tip of course, uh the total bill um And you see those, uh clusters of of information and interestingly, uh When it comes to to dinner then Male customers tend to give way more tip than female ones. I don't know why um another chart type which, uh is Very very informative for humans. Uh, those are heat maps What you basically see is I mean down here is a heat map for for network latency um where, uh, at some point a bandwidth limitation kicks in and then, uh You see, you know, how the latency is distributed for you for your packets and that is not something Uh, a single number could kind of tell you Or a distribution Um another thing that also, uh, Brandon pushes quite a lot of flame graphs They also are extremely informative of understanding what what goes on in in your system Um Right now I'd like to come back to more the Categories we we were at before It's a cpu memory disk i o network i o um, and let's start with cpu performance. So Basically the question is how long does a computational task take? I was asking One of our principal engineer is how he would put this in three Bullet points for a slide and his answer was it all depends um But it's true, uh It really really really depends on on the architecture. So my first computer had a set 80 cpu in there. That was something you could easily understand um modern computers or modern cpu's Are quite complex systems, um, it also depends on on caches as we saw before memory latency IPC so that is instructions per per cycle clock speed And if you are after integer or a floating point performance Um, so on the good old days clock speed was the thing I already mentioned we had the one gigahertz And then it was two and three and then people found out oh actually we cannot scale this up to infinity Then they started to add more cores and and and stuff But of course if you have an application which is bound to one core then The other 71 you saw before are kind of idle So the answer really is It's a mix of of different micro benchmarks You would need to run to answer this question and this really comes down to To your workload or to your application to really say Uh If you have the right cpu performance or not, um, there are standardized ways like spec cpu, but this costs money to use. Um a free thing I a free Benchmarking or stressing tool, uh, I find extremely valuable is stress and g So by now they have some like 70 cpu specific. We call this stressors. Basically, it's it's micro benchmarks and The tool lets you say, okay. I want to have everything you consider to be a cpu Benchmark or cpu stressor and a cpu cache stressor Run this for a minute and uh, just dump the matrix um And then you get a long long table like this um They call actually the the result bogo ops Which is not a bad name because you know, we never know really What that number at the end of the day means but you know, you can compare those numbers between different cpu's and different architectures and actually the the art at the end is To pick the right ones out of those and combine them in the right way And yeah, that is basically what uh, the spec consortium does with with spec cpu um Even though you have to be careful that you use the same config if you want to compare two spec cpu runs um with each other so I mentioned stress and g um Overall, they have now about 200 of those stressors. They hit various uh subsystems like block io for network. Um You can do things like threshing caches. So basically uh invalidating them um They are also targeted kernel Interfaces um And uh, yeah It's also a very good tool to to check if your system is is robust um Right, let's get to to block io um the question here is I mean we want to read and write uh files. Um, they are split into into blocks in in different sizes and One of the applications would be to read a long long sequential block or blocks of You know a big file basically you want to get this off your off your hard drive um Where another type of an application probably requires quick random uh access and then seek time is is a big thing. So Like you will see in the others The tradeoff always is high throughput versus uh low latency. Um A few more parameters, uh One has to understand when uh, we talk about uh block io so IOPS uh input output operations per second. Um or people call it io capacity and uh Sometimes you have uh manufacturers that come up with a big big IOPS number and uh, they want to impress you with that. Um It's not It's not always the the truth to what your Block devices they actually are capable of. Um Another one another parameter to look at is is q depth. Um, that basically means how many IO Uh operations can be in flight or it can be queued. Uh At a certain time or at the same time I should say, um and Latency basically means how long does this uh io operation take? So and when you have a big q depth and you basically get tons of of IOPS But it doesn't mean that uh, your latency is actually where it should be for your application. Let's assume Uh, that your application needs a very very low latency. So and you have a little formula how to calculate your IOPS Um That is a nice, uh visualization of the linux storage System I found so basically up here is uh file systems and caches. So that is the thing where it can bypass that um And you have the the whole block layer and and down here your your devices And when you benchmark, uh block io that is also something, uh to To have in the back of your head, um There are many many tools uh to measure all those parameters. Um, the one I find extremely useful The one, which uh, I use the most is is fio that stands for flexible io mapper And it's actually from the kernel maintainer That uh Is in charge of this block io player in the linux kernel Um to get started with it, I recommend uh to look at the examples uh that come with it. Um, there are even examples how you can measure s3 So it's capable of of many things as you will see over the next couple of slides. Um It also can do uh even hcdp requests So this is a default command. So Name is basically the file on the disc we we use to write. Uh, we say we want to have uh a random read write Uh benchmark We want to have direct io. So we don't want to measure memory performance in terms of of caches Uh, we use the io engine lip aio The block size is one megabyte the mix for Random read write is that we want to have 75 percent read and 25 uh write Um, which is usually a pretty good mix. Um the size of this file here we create for for writing is one gig Um, we want to have group reporting. I mean even though it doesn't fit them on slide the output And iodeps of four means we have uh four i operations in flight, but this is not what underserved controller is And we want to have two jobs in in in parallel so and then you get tons of output There are block articles that discuss those I also have to warn you that with newer versions of fire that from time to time changes if you want to pass that don't fire has Fire can output all of that in for instance jason and other file formats. Uh, you can easily Uh pass later on um, but what you get here is uh, basically, uh the read iops are I didn't like this of 166 So i operations per per second. Um That gives us the bandwidth of about uh, 66 uh, maybe bytes. Um Yeah, um Another thing Where i may be confused a little bit is uh, he gives you this uh, s led C led and led. So those are latency For submission and completion Another thing, uh, you should be aware of I know it's probably hard to read. Um Always look at the at the units here because sometimes you can get confused why those numbers don't add up and then you see It's all about the units. Um, you also get percentiles. Um In your output and yeah, there's tons of more information. Uh, it even reports you page faults So whenever you say major page faults, uh, then there's probably something else Cropped um network Um, we basically have the same problem, but this time we call it transmit and receive. So receive is reading and transmit Writing to network. Um, we have the same things. It might be we, you know, want to stream this talk to the internet Then we are all about throughput Um, it might mean we want to do a vibe call over the internet Then latency is is a big issue. So basically round trip time um, and we have the problem that, uh, networks usually are unreliable. So if you transport something over the internet, uh, There's no guarantee that your packets arrive um Yeah, I showed you this, uh kind of picture before. Um It's actually done with this file. So file also can can target, uh client server and then it's, you know, the same thing We we measure i o and if you if you take the latency for each packet that gets transferred and Use something like my purple In that case, uh, you can get this very nice, uh heatmap charts where You can try to understand what's going on for many many hours. Um So if you know what's going on here, so we again kick in bandwidth limitation and then At some point those two, uh, spread out here. Um, so far, I haven't found out what this means um Right, I said, uh, one of the challenges is that uh, you have packet loss and packet corruptions and retransmits and all those things. Um there's a nice, uh module in the linux kernel called net damn And you probably would, uh put this in between, uh, your client and service where you, uh, measure your network performance and you, uh, can with three lines introduce, uh Those errors at this point. Um, so basically, uh, you load it into the kernel. Um To your root Qtask you edit and then, uh, you do things like I want to have a delay introduced between 50 and 10 milliseconds. Uh, I want to have them normally distributed. Um, I want to have 0.1 percent corruption and I want to have 0.1 percent packet loss and Yeah, that's What do you get? So basically the the latency and stuff is much more spread out so for memory, um Again, it's the same thing read write. So we are interested in in throughput You are interested in latency. We have to make sure to not measure cache performance. Um Other things to keep in mind, uh page faults or how you actually access your memory. Uh, So if you have no more notes, uh, keep that in mind to to really measure the thing you want to. Um, one option, of course is again stress and g. So this time we use the class memory and vm uh again that, uh Takes time to find the right stresses for for your workload. Um, There are other benchmarks like this one called x mem Uh, mainly developed by by microsoft, but it's open source which, uh, basically has all the features, uh, you might be interested in. Um And, uh, also supports basically, uh, many many different platforms. Um, So in case you're interested in arm, uh x mem Serves the purpose um Right some other performance measurement tools. Uh, I find Useful and it, uh came to my mind when I made the slides. Um First of all again, I'd like to stress brands, uh brands block. So there are tons of of tools like I don't know really where to start. Uh, perf is a very important one s trace. Uh, iotop block trace ios stat. Uh, There's a great talk by brendon himself. Uh, Where you find a link on brengarack.com slash young's perf. The hml Uh, totally worth to to spend this hour and he basically goes over many of those tools here And explains in which cases they are useful. Um, another one I found, uh, extremely helpful Uh, is sysjitter. So sometimes you have the problem that, uh, or Yeah, maybe I explained first was sysjitter does. So it's basically running a thread on each CPU of your system And it measures the time the thread gets stopped And this gives you a very good measure of, you know, how much jitter is introduced in your in your system Of course, you should run this on an idle system because otherwise it doesn't mean much but, uh That is a very good tool to to measure that and to see if something, uh Strange is going on in your system and yeah in that case it run over four CPUs that I'm and uh, you have percentiles and everything and here you see the uh The total, uh Interruption time in in percent. So the system first of all looks pretty even and also, uh, looks pretty much okay um and then you have those, uh tools that More or less try to do everything. So those are more frameworks. One of them is Suspense, so if you don't want to get into details of finding out what are the best, uh CPU micro benchmarks and stuff then maybe suspense or something similar is good So they have one Number you get for for cpu one for memory and and other things they also have a mysql database benchmark Um, yeah, and you cannot your own and uh I i'm not sure if they provide charting there But this gets you gets your job done pretty quickly Um, another one is hpcc. I especially mentioned that for the stream memory benchmarks um They're pretty useful Uh, I like their chart where they basically uh Have the numbers of all the the benchmarks they they support and then you get the circle, which is good and you know, you get this weird Uh distributions where one benchmark is good and another one is not um Perfkit is something from from google uh Where they basically uh wanted to provide a framework they can uh Benchmark different cloud providers. Um Even though you're not even if you're if you're not interested in in cloud computing, but you know have your own servers They have a long list of uh of benchmarks that make sense And the other good thing is they already come preconfigured. Uh, so you know which which parameters are A good pick Um, and of course you can run those things individually Another one probably the largest one is the foronix tests suite. Um, they really have a large collection of benchmarks and tests Um in there, it's a framework. Um Yeah, you will see on their webpage. They sometimes try to uh attract readers in kind of a clickbait way uh A big warning Make sure you do not upload your your test results to the foronix servers. Yeah It is said that some companies did this accidentally with products. They even they didn't even have shipped That's probably not really what you want. Um right and uh I like to close my talk with at a case study I found uh Couple of days ago. So the the headline is running a database server and easy to your clock could be slowing it down so they obviously also had the performance problem and they did long analysis and they figured that uh They were using a clock source which the extent clock source Which required as a score and that totally slowed them down. So Sure use tsc where you don't have to Execute as was called to get the time Uh, the thing I want to say with that is uh, it might be that you really try to chase down performance bottlenecks And then it's something completely different um, so last slide, uh Understand your systems and and bottlenecks monitor continuously and also benchmark from time to time all right Thank you