 All right, and next up we've Peter Chubb who's doing a talk on a is Linux getting slower, so Peter hi This talk started when a colleague of mine Godfrey who's really interested in statistics Wants to work out whether his benchmarks, which were very noisy actually showed a slowdown of a speed up all the same So we worked out some statistical tests and we were going to present that in the main conference And then he realizes green car was about to expire so he had to dash off to the u.s. And couldn't come to this conference so You don't miss out on all the benchmarking I did this was Motivated by a statement that Linus made about a year and a half ago At Linux con in 2009 The point is It's a bit sad that Linux isn't the streamlined kernel that he originally Envisaged I remember when it was faster than anything else I had and I could outperform our Solaris Big spark box with a 486 running at 50 megahertz. That's no longer true So how do we measure performance? This is a standard system performance graph. It's The shape of it was analyzed by fellow called Neil Gunther and if you look on Wikipedia under the universal performance law You'll find out all about it Basically, it's got three bits in this part here throughput is Roughly proportion to The number of jobs you give it so you give it a job it finishes and then it goes into idle for a bit You give another job it finishes you get another You know so so the rate at which jobs complete the vertical axis is approximately proportional to the rate at which you give jobs to it Up here you start to get a bit of a slow down Well, what's happening there is some critical resource is beginning to be a bit contended So instead of a job arriving Running and finishing and then getting out of there a job of Ives and it's queued behind some other job. That's on the processor And it's got to wait for the queuing time as well as the running time So you end up start starting to go along this way here Eventually the curve starts coming down again. What's happening there is Your jobs are running and they're running at whatever rate the queues will allow them But when a new job arrives it steals some time from some resource that's needed to make forward progress So the act of putting in more jobs actually steals time from the forward progress one so you get lower performance Okay, so what I wanted to do was run the aim seven benchmark on as many different kernel versions as I could afford the time for Over a reasonably long period of time First thing I had to do was find a machine that would run all the kernels from say 2615 up to the present Second I had to find a user space that would run all those kernels Thank you the aim seven benchmark is It's configurable it's aims to emulate a whole heap of people hammering on a machine Doing different things so the idea is that it emulates end users and the way that I run it is you increase the number of Emulated users until the performance start per user starts going down massively There are two benchmarks up to workloads. I used one was a database like workload. That's got a reasonable amount of IO in it and the other one was a high system load benchmark Which doesn't do any disk IO but does do a lot of stuff that has at least 50% CPU time System time kernel time And I also an LM bench, which is a micro benchmark that tests lots of different things individually so Come on Yeah, this was the machine I used. It's a Pentium 4 running at 2 gigahertz two and a half gigahertz It's got a relatively slow disk Running user space was to be an old stable So that was really nice because it would run all with all those kernels Doesn't use you deaf and that meant I could turn off I notify and get 10% performance boost We use the CFQ IO scheduler. I rebuilt the file system between tests so that disk fragmentation wouldn't be an issue And as close as possible. I use the same kernel configuration for all of the things we use so The new database workload about 20% of the workload hits the disk And that's important as we'll see The kinds of things it does is copy files together called sync create file read files reread files write to file opened o sync Does page faults does some CPU intensive stuff all of those things and For 2615 this is the way the curve looked Again along the bottom. You've got the number of users Simulated users up the side You've got the jobs completed per minute and it follows the curve pretty well with a very steep beginning part Some jaggies there which are reproducible. I don't exactly explain understand why and then it starts going down So that was 2615 about 2006 Four years ago 2620 about a year later This time we get a much bigger heap peak, but you've got to put a lot more work on it to get to that peak So for ordinary users who don't stress their machine among them at 100% of capacity all the time They're getting less performance Again, you've got jaggies which are almost exactly the same places, but lower. I Don't know why still, but you've also got this flat bit Which ends at 42 I can't explain that one either and I'll come to that in a minute So 2620 we're getting slightly better performance Overall, but for normal use cases we're getting lower performance So let's go to 2625 Now we're peaking a lot lower For ordinary users we're getting even worse. Can you see the pattern beginning to emerge? Let's go to 2630 a year later again This one it's about the same as 2620 for ordinary use, but the peak is still lower 2635 a few months ago and here so it looks like from this workload at least This is the database workload that Linux is getting slower I should know that if you just take two-point versions 35 36 and Run them that the lines overlap to an extent you can't actually tell that there's any slowdown between these ones, right? So if you go to 620 to 615 to 616 The lines overlap to the extent that the standard Deviation bars overlap and you can't really tell without doing some fairly sophisticated Statistical tests. Yep. It's lab all the way through Because I don't think slub exists in 2615 Yeah, yeah No, no Like I said, I've tried as far as possible to keep the configuration exactly the same So in these ones, I'm not using live at a I'm still using the old driver because that would be yet another layer right So what's going on? Later kernels are giving me worse performance. I hate this There was a massive jump between 2620 and 2625. What happened? I don't know I go through the logs and there's thousands of changes there Something like that. Yeah The kinks of all eight job boundaries if we go back to that one That one's at 42 that one's at 24 So it's the number eight embedded in the IO scheduler somewhere it is. Oh I'm using CFQ CFQ batches things in eight request lots So if the jobs happen to be queuing eight at a time and so maybe that's it so I Tried again with deadline, which doesn't do that and get exactly the same behavior with exactly the same kinks I'm hoping that you people can tell me what's going on. So something else is happening What is it? All right? There's two things going on this kernel system time and there's IO stuff. So let's try running With the high system time benchmark This has no file system activity But lots of things that all of which have at least half their time spent in the kernel And we get this curve. That's just the two endpoints to 615 to 635 So it's not that it's something in the IO the block IO system That's going on. So I have a no profile After Linus said it might be cash footprint I tried with cash misses and see approximately the same number of cash misses for each one. So it's not that This is just cycles And what we'll see is the total kernel time for 2 615 is about 1% the total kernel time for 2 635 is about 8% and We're spending a long time in memset and sync inodes sb, which is the thing that actually sinks the inodes out to the disk right back so Try the LM bench the LM bench gives you lots of micro bench marks and most things are getting better I mean to context switching is a bit slower File creation is a bit slower File deletion is a bit slower page faults are faster But the standard deviation of 18 means that I don't know how much I need to do a student t-test on it to find out whether It's really a difference or not. I didn't do that So everything's getting slower So the problem somewhere in the VM the block layer the VFS of the xc3 and I really suspect is the VM right back stuff But I don't know that for sure What I'm really alone about is that I haven't seen anything written about this by anybody else Is Nobody else running aim 7 with the database workloads any pairs of people's molecule here and I sort of ran out of time for any further investigation So my caveats. This was just on one system. It's an older system. It's UP other workloads will give different results and Nowadays our kernel runs on far more different machines. It scales an awful not better than it did in 2 615 2 615 the SGI was just beginning their work to try and scale to 1,024 processors We go beyond that now by a long way and We're somewhat well a different bug set from what that was there So where to now well This is a plea to anybody who puts patches patches in the kernel I never never never want to see the words It's down in the noise for a difference unless you've done student t or similar Just check that there is really no statistical difference and we really need better ways to evaluate the performance feature trade-offs So that when we put new features in we know what the performance Is the performance cost is and that's going to mean lots more benchmarking The problem with benchmarking though is it takes an awful long time each of those curves took a day and a bit to produce so Yeah, so has anybody got any comments about this Yeah Sorry, can you speed up? Perhaps we do need an automatic system that runs things. There is a group that's doing that I can't remember what they called but they do try running benchmarks and picking up regressions fairly soon after a new stable kernels released, but I don't think they run this benchmark Otherwise able to see in this. They don't really care about uniperses of either. No, that's true Yeah, and I The the multiphases of performance on two six fifteen sucked anyway, so I wasn't gonna try it Yeah, the the the comment is that because I didn't reconfigure things to use the latest Versions of things like slab versus slub or the the newer on-disk format for ext3 compared with the two six fifteen one We're unfairly disadvantaged in the newer kernels And I think that's a fair point. It's just that from the point of view of being able to reproduce the results It's important to keep things as close as possible to the same if we if I'd make many massive changes I would also have to change user space to take advantage of those things You know newer g-lipsies to take advantage of newer interfaces and as soon as you do that Which they have Yeah, so that was a slub is all internal the newer on-disk format for ext3 is all internal But there are some others that aren't So, yeah, this was exactly what our talk was about Godfrey has developed such a tool in and There's a script available and but you need to run the things more like 30 times and three times to get Significance because of the law large numbers, but yeah, well as far as that bit's concerned that bit's easy We produced that and put it up on our wiki about Five years ago, and there's an LMB some Perl script which we modified to make it work And you run your LM bench as many times you want, you know for vexing back tick sick 130 back tick and then Run the make see pipes through Perl LM bench perl and then it will give you a The standard deviation and the number of runs for each of the means that's produced comment is that some of the distro vendors such as Suza have Stabilized an early Cursion kernel version such as to 616 back pointed and cherry picked Patches that come forward and make sure these are based all works with it and so that may have some way of Finding out what the problem is David's comment is that we're probably fitting hitting the bike back problem that everyone knows about but nobody wants to Talk about very much the problem with that theory is I'd expect The right back problem to have been introduced and Then you know and then all the rest of them be more or less the same place But but here we see Something to my thing pointer, but here. We're just seeing a continual degradation. So there's something else going on as well Yeah, that's right Yes, that's right That's where I suspect the following is there but after that there's still a problem and there's something else going on and Any of the problems this is a general rule with performance issues you've solved one and it's like peeling an onion Solve one and you get to the next layer of the onion. Yeah at the back there. Yes, I'll agree with that There are three problems And it could be that the eye-cache footprint is going up there I couldn't see any evidence from that in the O profile results. They're smallish but they are reproducible They're real and that the problem is that I haven't put the results up because I didn't have time to run enough runs to make it This is significant, but 36 and 36 35 was even worse again and go backwards Well, I'm hoping so I'm hoping so hoping you found that interesting