 So, yeah, today the talk is actually not as much about file systems or the EXT4 file system in specific as much as it is about making programs specifically the operating system scale to a large number of CPUs. And in some ways this is an old story. It's something we've certainly done before but we did it for a while and then we stopped it's been long enough that I thought it might be good to do a little quick refresher course so we're gonna set the way back machine and we're gonna set it for the year 2001 so 2001 was a good year it was the time of Linux 2.4 IBM had just announced that it was going to invest a billion dollars into Linux and you know Linux had SMP but depending on who you talk to and what the workload was a lot of people basically said yeah you know it didn't really scale all that well you know for CPUs maybe hardly you know and so you know what does that mean why does this matter why should you care so let's talk a little bit about SMP scalability which is you know what do you do when one CPU just isn't enough and this was actually a lot more important back in you know the early 2000s because CPUs were just simply not as powerful as they used to be as they are today you know that was nine years ago so that's about six doublings of Moore's law and so at the time if you look at the late 90s the early 2000 2001 people would actually buy and pay top dollar for machines that had 4816 32 Pentium's put together in a box sequent sold an awful lot of those and those machines were really really expensive and why is that the case and the answer is that coordinating a large number of sockets into which you're going to be plugging in actual CPU chips is really expensive what if you have two CPUs that want to talk to the same memory location they either are both writing the same memory location one is reading the other one's writing CPUs generally have caches so there will be some cash on the actual CPU socket itself and if another CPU was written to it how do you make sure that cash coherency is maintained right typically you either have to have one CPU telling all the other CPUs I just modified this memory location if you have this memory location in your cash get rid of it or you can have CPUs and caching systems that basically do what you call cash snooping which is to say they listen on the bus and they see right transactions go by modifying particular memory location and they have that in their cash they have to delete it this is an awful lot of extra machinery that you need to put in there and in fact this interconnect how do you actually make the CPUs talk to each other and the CPUs talking to memory is critically critically important because if that interconnect isn't fast then your whole system won't be very fast and this is true for SMP machines but it's also true for those of you who might have tried to build clusters right the interconnect fabric between individual machines in a cluster is very often the thing that will prevent a cluster from being as effective as you think it is now there were various people that would try various things I called cheating here on the slide that's not entirely fair because the answer is if the machine is too extensive for what you want to do you will simply not do it or find some other way of doing it and so one what thing that you can do to try to make things cheaper is by accepting the fact that some memory locations will be much easier to access from some CPUs than other CPUs so it may be it may only take one quantum of time to access the memory location if it's connected to one CPU but if it's on another CPU that's far off maybe it's 10 times that quantum of time in order to actually access it well that's true then your operating system had better know what memory locations are close to which CPUs so you can actually put things in the right place and that's actually pretty complicated it's really hard to get right sequence did a lot of work in that area and then later on SGI with their altix machines were even more extreme right I mean they would have machines with 10,024 CPUs and if you need to access something and you are on the wrong CPU you would wait a very very long time so you know that's Numa but the bottom line with all of that is if you have a machine with four CPU sockets it's going to cost a lot more than just simply having four single CPU machines and there's a funny thing which is once you do that customers get really really irritated when they spend a huge amount of money for this machine that says it has you know for 16 CPUs right on the box and they don't actually get all the CPU power that they actually paid for right so you know and very often what will happen is you'll have some clueless salesperson who will sell you know a 32 socket machine that is heavily Numa to some big important customer without asking the customer what do they actually plan to use them box for and then they buy it and then it doesn't work as expected and you know the customer doesn't blame himself the customer will not blame his application programmers the customer will but will blame the salesperson who actually sold him what he considers you know POS right so what do you do well you know how do we actually measure scalability well the first thing is we're going to run one or more benchmarks on a single CPU system and there are a large number of benchmarks these were some of the ones that were used in the early 2000s 2001 2002 2003 and this is actually a really important thing which is that different benchmarks measure different things on some benchmarks are actually more about the Java virtual machine and how the Java virtual machine talks to the kernel than anything else example the Lano mark others are you know really stressing the network some will be stressing the file system and so scalability as a single number means very little what works really well for one particular program one particular benchmark is going to really really suck on some other one but the basic thing you're going to do is you're going to take some benchmark you're going to run it on a reference system that has one CPU or maybe you boot a single CPU operating system image you run it there now you run the exact same benchmarks on an on a system with say for the sake of argument eight CPUs and you know will sort of divide the score that you have on actually I got this backwards on eight CPUs by the score that you have on one CPU so reverse that up on the slides and that's what you will actually get is some sort of scalability for a particular kernel version for a particular benchmark and you might say yeah well you know so this particular version it scales 12 out of 16 CPUs on the Fublabs benchmark right and all of that is important if you don't mention the benchmark if you don't mention that you know it just really doesn't make much sense so scalability is really hard especially if you depend it depends on which workload what benchmarks but on some benchmarks just simply 12 out of 16 CPUs might be considered really good right I mean you know that might be considered excellent but think about what that means that means we're only using 75% of the theoretical power of the machine for a machine that costs way more than 16 times you know a single CPU server and again a lot of people aren't really expecting that and if you actually go back to the early 2000s on a lot of these benchmarks Linux was barely scaling to four CPUs by which I mean you know there was well less than two in some cases two or three times a single CPU benchmarks and there were some benchmarks where it was actually slower to run it on a four CPU system than a single CPU system which is what you call negative scalability which is really unfortunate and so in 2001 there was an effort called the Linux scalability effort it was spearheaded by folks at IBM's Linux technology center but there were a lot of other companies involved SGI Intel VA Linux University of Michigan City I think HP may have participated a little bit and they did weekly conference calls the minutes of those are still up on lse.sourceforge.net if you actually want to go see them all there they're actually all archived there and this is sort of the key methodology of how you actually improve scalability for a system and again it's been a while since we've actually done this in a concentrated way so I think it's good to kind of remind people of you know you can use a regimented approach to actually make progress very quickly but what what they actually did you know nine years ago was they had regular benchmarks that were done by a performance team and they would basically run the same benchmarks every week maybe every month and you would actually get progress that you could actually chart up on a chart and you could see how well you were doing because every change that you make would change the bottlenecks and would change what what you actually needed to do next if you don't actually have the feedback you don't know what you actually need to improve next and I'll give you a demonstration of that in a little bit so you take a look at the profiles that you get block profiles CPU utilization and then you find and fix the bottlenecks you have and then you just simply do that again and again and again until you fix you know all the really embarrassing ones and so between 2001 to I don't know mid 2003 this was actually being done very very regularly and then people declared victory and everyone went home that's not entirely true SGI kept on doing their altic systems but those were actually really specialized and again you couldn't take some arbitrary program put it on a 1024 CPU altix machine or 256 CPU altix machine and expected to actually perform 200 times better or a thousand times better right it was actually very very specialized but for the rest of the world they just sort of said yeah you know it's good enough and what did that mean it was good it was definitely really really good it certainly wasn't perfect you know we were getting pretty pretty good on eight CPUs you could actually use most of the CPU power on 16 CPUs it wasn't you know as great but it was still decent and it was an acceptable number of CPUs on 32 machines now again all of this has to be qualified by the benchmarks there are some benchmarks that do really well how many of you remember Anton Blanchard demonstrating a kernel compile in 30 seconds on you know cold system yeah some of you remember that right that was a case of a benchmark that's actually really easy to do because you can distribute the work across a large number of machines other benchmarks you're not gonna you know be that good and so might be interesting some of you made some why do people stop we certainly could have done better and I think the answer is that Linux did really really well on a certain segment of market not necessarily as well on some of the other platforms right so on x86 it did really well on the lamp stack web servers scale out but it turns out funny thing at least back then in the early 2000 people who were going to spend hundreds of thousands of dollars on high-end spark or power servers tended to prefer the other legacy operating systems at least in the enterprise market and a lot of that may you know there are a lot of reasons why some of it was just the fact that they had a whole bunch of system administrators who were used to how Solaris or how AIX worked in some cases it was because there were certain enterprise features like multi-pathing that weren't really ready for prime time back then so if you're gonna have this really high-end system you're probably gonna have a really high-end IO infrastructure but for a variety of reasons they weren't necessarily buying the really really large number of CPUs and putting Linux on it and at least at that time very few x86 servers had more than 8 to 16 CPUs just because if you had more you generally tend to use other operating systems and so there was less need to actually scale to that and you know as I've already said Linux really is sort of the king of scale out computing where you have a large number of servers all working together and for a lot of problems that may make a lot of sense if you're running some sort of you know web service it may be that the best thing to do is to have a whole bunch of web servers and they will do the front-end work talking to an individual you know web browser or you know a whole bunch of web browsers but then each of the web servers then talk to a single gigundo database and that one gigantic database maybe it runs Linux but it could just as easily run some other big Unix system and that will be really really big and all the small front-end systems would be Linux database might be Linux might be something else and so I am now going to dismiss four to five years of history with a wave of a single slide and say yeah you know that was sort of the end of the scalability story and that's you know highly simplified but it's also the case that that's a very very long time in an industry where you know I don't remember who said it but it's true you know in the computer world two years is infinity right and so four years is a really long time a lot of people have sort of you know sort of said oh yeah well Linux scales large number of CPUs we're good we don't need to worry about it now during that time we also saw Linux getting used a whole lot more in the embedded and mobile markets and those were traditionally smaller CPUs you generally didn't have you know SMP on the embedded and mobile market that's starting to change by the way and then an interesting thing started happening about oh two three years ago which is CPU frequency stopped doubling every 12 to 18 months we sort of hit a wall as far as CPU doubling and so now what's happening is people are actually putting more cores on a particular socket so you know you've got two you got four eight cores on a socket and these machines are really becoming mainstream you know I actually have at my desktop at work a dual socket hex core system you know so that's 12 cores total and it really wasn't actually all that expensive right I mean these are systems that you can buy today without you know spending you know tens of thousands of dollars so you know here we are in 2011 and scalability has started to matter again right you know you've got four sockets a server system with four sockets on it really isn't that rare they'll be you know bit more expensive than a single server blade say but they're certainly you know out there and if you've got eight cores on a socket and there are more coming that's 32 CPUs as a common configuration for Linux and so this is time now for kernel to programmers to rediscover the lessons of scalability tuning and it's also time for the application programmers to start thinking really hard about multi-threaded programming for those of you who haven't started doing it already because last year Intel has actually released demonstration systems I don't think they're available in common use quite yet but 32 cores on a system a night's fairy you know sound good to you sounds great to me can we actually use it that's the interesting question and so now let's actually talk about file systems ext3 had for a long time what I called good enough scalability because the dirty little secret is that most workloads certainly most x86 workloads don't really stress the file system at least historically you ended up hitting other bottlenecks first now a lot of those bottlenecks are getting knocked out of the way so it's starting to become a lot more obvious and of course hardware is changing the other thing that happened in particular in the enterprise market was the few machines that tended to be really big iron machines were database systems and enterprise databases generally tended to use direct IO to files that were already pre-allocated you know so this is something Oracle does something DB2 does turns out ext3 is really really good for that case sucks in almost every other case but in that one case it actually was pretty good all right yeah so now was certainly true that ext3 didn't do well if you actually did head-to-head benchmark competitions against other file systems right and this is an understatement there's some really really good file systems out there that do a much better job if you actually have a very file system intensive workload but the funny thing was a lot of system administrators really didn't care right because for their workload it wasn't the bottleneck ext3 worked and in some ways the fact that ext3 was a more simpler file system meant that it was easier for them to fix it if things went wrong and you know it had fairly good tools for automatically fixing it for you know looking at it with debug FS funny thing those things are actually also important not just performance but you know ext4 has come along and it's long past time for us to actually worry about scalability on ext4 and so this story begins in April 2010 and the IBM real-time team was trying to improve file system performance when config RT preempt was actually enabled and they noticed a little minor problem when they ran the debent benchmark which was they were spending an awful lot of time on spin locks in fact they were spending 90% of their time on spin locks and you can see where roughly two-thirds of that time was in JBD journal start and about a third of the time was in JBD journal stop and those routines get used an awful lot now fortunately it was actually pretty easy to figure out what's going on not only did I know what was going on but also fortunately the stuff is actually relatively well documented in the header files safety tip if you're going to do multi-threaded programs and you're going to be using lots of locks document two things really well for each field in the data structure that might have concurrent access document which lock is supposed to protect that field and also document the locking order right what order do you have to take locks if you're doing multiple locks there are a lot of locks in the kernel where this is not obvious and you basically have to stare at the code and hope you get it right when you modify said code but if it's well documented it's actually pretty easy so you know you look at the header file and you say okay there's a J state lock it protects fields in the journal super block and the T handle lock that protects fields in the transaction handle structure and as it turns out JBD journal start and JBD journal stop we're taking both locks for every single transactions now what are these things actually used for well it turns out that file system transactions are well transactions in general are expensive and so what many file systems will do is they want to do journaling so what they'll do is they will group multiple file system operations into a single giant transaction because if you were going to do a transaction for every single tiny operation every single stat every single chumad every single unlink it would get really really expensive so you group them together and every five seconds you do a commit and you might do it sooner if the program explicitly requests it via an F sync or the transaction starts getting full or the journal starts getting full and then every single one of these file system operations whether it's a chumad or a chumad or a or an unlink are bracketed by a journal start and a journal stop call and in the journal start call you pass in the worst case estimate of how many blocks you're going to modify so you can check to see whether or not a new transaction needs to be started and in order to check to see whether or not there's enough space for this new transaction or this new operation in that transaction you have to check and see how many blocks have been modified for a particular transaction as well as how many free credits are still available in the journal which is why we have to take the locks so the first thing that I noticed very very quickly was well G jbd journal stop was taking this jstate lock spin lock but I couldn't find any field that it was actually using that was actually protected by the lock and this was helped a lot by the fact that everything was well documented about what was actually taking what and it's like oh well if we're not actually if we don't actually need the lock we should just remove it and just simply removing that lock I couldn't find the numbers but the real-time team people were really happy that that made a huge difference for them right there just simply removing a lock that wasn't necessary and then Eric Whitney from HP said well you know I have this really nice big 48 core AMD system with hardware raid let me do some benchmarks for you and he just sort of you know offered it and so this was the results of that first patch and you can see here the dark green is xd4 the light green here is with the patch applied and the blue bar is XFS just so we can see you know XFS a lot better than xd3 or xd4 and you can see here though that just simply removing a lock that we didn't need right we just simply took the lock released lock this was literally something like a four line change because I was removing it in a couple of places and there were some exit paths but effectively just simply removing a lock we didn't need happened to be a lock we used a lot caused an immediate throughput improvement once we started putting 48 threads on 48 cores really helped a lot and you can see it also reduced CPU utilization but not by huge amount but it definitely did help this is with this was with large file creates with with random rights we saw a similar improvement with random rights it turns out XFS isn't quite stomping on the xd4 quite so much but still again an improvement so the next question was well can we do better well one of the things that Eric also did for us was he used a very much more powerful profiling tool than a simple CPU profile he actually used locks that and so this is how you actually turn locks that on on and off pretty simple I'm not gonna dwell on that this is what locks that results actually look like which is really really hard to read this is an eye chart so I'm going to be for the rest of the presentations showing you the numbers in a somewhat easier to understand form but understand that I've reformatted it right this is not what you'll actually see the previous slide is what locks that output actually looks like and so for each lock ordered in terms of which lock is actually hurting the most we can see contention bounces that's the number of times that you actually had lock contention across two different CPUs so lock was really expensive because it had to bounce back and forth the number of lock contensions in general wait time max which is the amount of time in total that we spent waiting for it in nanoseconds you can see that number is just way bigger than every other number on here oh this is sorry this is the maximum and then wait time total is the total amount of time that we waited for it on that particular benchmark and then we have acquisitions how many times we actually grab the lock whole time is how long you actually hold a particular lock and you can see that even though the lock was only held for a certain amount of time when you take into account how much time you needed to actually take the lock that can be a hell of a lot more than the time that actually hold the lock right so this is just a quick sort of look at how you might interpret lock stat numbers and then there's details for each of the locks showing here how many times a particular lock contention was caused by a particular function so the function start this handle was responsible for the bulk of the times when we tried to grab the lock and there was contention followed by JBD log start commit and then the bottom set of numbers are what functions were holding the lock at the time of lock contention and again you can see that the two lists will generally be have a lot in common start this handle is certainly a huge part of the guilty party as is log start commit and then you can see for T handled lock similar set of details we again see the journal start and stop are a big part of that because I removed it from J from journal stop I removed the J state lock from journal stop but the T handle lock was still there so it's really high up there so what do we do what do we do so the first thing that we did after he gave me that report is like okay there are a huge number of places where we were just simply taking that result to increment a single statistic or a single credit variable and in fact there was no need to actually have an atomic modification across multiple fields in the data structure which means we could just simply use an atomic T instead of actually taking a lock and then modifying it one of the open questions is right now we always do the statistics gathering even though it's actually pretty rare that people actually query the statistics so maybe we should take that out and then there's the accounting information and then the other trick that we used was it turns out that in most of the places we only needed to protect the data for reading purposes but we could actually allow multiple CPUs to actually read at the same time it was rare that we actually needed to modify it so we could use a read write lock and those two changes in total meant that we could now start and stop handles in parallel so you can now except for what do you do on a commit we're fine and you can see what a huge difference that actually made here again the dark dark green is ext4 light green is the no journal and this hash mark here is the no journal mode which was actually done by accident by Eric but then I said well that's actually an interesting result so let's keep that and again you can see that makes a huge difference for the amount of CPU that's used and this is for random writes the lock stat numbers show something which is really cool which is we are now know the top lock on that's actually causing problems is no longer the journal lock it's actually the block I O Q lock and you can see here that for the block I O Q lock it's because we're doing lots and lots of make requests so yeah in the J state lock you can see we're still needing it but it's no longer the primary problem so that's really good right it means that it's no longer JBD2 which is the bottleneck it's the it's the block I O layer and it turns out what's left is actually going to be really hard to fix because what's left is actually something which is kind of unavoidable which is when the transaction starts getting full we finally have to say okay it's time to stop this transaction and start a new one now in order to do that we have to wait for all the other pending micro operations the handles to complete and that sort of unavoidable so there's a certain amount of waiting for the other handles to happen that you just kind of have to do fixing it might be possible but you know it's no it's going to be quite hard the real problem now is how XT4 is submitting its blocks to the block I O layer and to cut to the chase this is the problem which is our buffered rights were basically sending rights 4k at a time to the BIO layer and reads were actually done right so the reads actually use m page read pages but because some of the time when we're actually writing the blocks we also had to allocate the blocks we couldn't use the m page functions and we did our own stupid thing which was just simply submit each 4k block at a time now this didn't matter as far as the disc was concerned because the block Q layer would actually merge these 4k writes back together but this wastes a whole bunch of CPU and the lot of locking overhead as you've seen there are other things that it does which are kind of nasty it makes block traces really large and it makes I O statistics confusing because depending on which device you look at the numbers can either be before we've actually merged them together or after we've merged them all together and it's just kind of a waste right we spend all of this time you know assembling things together so we can omit a single contiguous right and what's the first thing we do it's like a ginsu knife we just cut it up into little tiny rights and scored them through so that was kind of annoying so I'm kind of running out of time I want to leave time for questions so I'm moving a little quickly here so what we did was I wrote a drop-in replacement for block write full page that would actually accumulate writes to pages in a function in a data structure and then when we were done we would call IO submit that would just send send it all down to the block IO layer as a single right required a massive overhaul and cleanup of the right submission path or at least half of it but that's okay it desperately needed the cleanup anyway and here in that results you can see here that on large file creates we this time actually did it with the journal and without the journal and you can see that once the patch was applied that's the cross hatch line it makes a huge difference as far as throughput is concerned and then the CPU utilization numbers are a little hinky because we're using more CPU but that's because we're actually sending more rights and if you look at random rights the one the interesting things is that you can see here there's a XT4 the patches don't make as much of a difference XT3 is down here but on some of these workloads on this particular hardware configuration for the first time we were actually beating XFS which was amazing to me it was not something I anticipated I hear you see XFS has done some more improvements since then so I don't think we're beating them now but that's that's what we're at unfortunately the work is not quite done because someone found a bug when you're using DM Crip which is a very slow device and post grass and we got data corruption so there's a race that we still need to actually fix so but we're working on it I'm actually pretty sure we will be able to fix it for now that last enhancement's been disabled by default because it can cause data corruption pretty rare I actually was running for a while and didn't notice it but you can turn it on again with the mount option so we can actually test and try to fix it now all we have to do is find the bug so just to sum up real quickly I hope this talk is a good way of reminding us all that we really do need to pay attention to SMP scalability again and that does mean thinking really hard about multi-threading which is a lot harder to debug lots of potential for race conditions like that last bug showed and performance tuning is actually kind of tricky and some of the techniques that we actually use such as atomic variables read write locking batching work together one interesting thing to note is a lot of what I've talked about in this talk also applies to user space as well you can actually use atomic T if you're willing to sacrifice a little bit of portability because you have to end up having the import headers that do assembly language p thread mutexes are actually pretty fast because we've actually done that a lot one obvious thing is don't use spin locks because they don't work well in user space code and two final things that I want to actually point out I actually found out about them while I was researching for this talk is Val grind has a new tool in its latest if you look at the latest version called DRD that will automatically detect data races in user space code so if you're writing multi-threaded code definitely take a look at that and Lenard has a tool called a new trace which does stuff a lot like locks that but for mutexes in user space code and again if you if you're writing user space code and you want to try to do performance tuning move trace might be a good tool to actually look at so with that thank you and hopefully we'll have time for a few questions yeah I have a question Ted is using all the atomic variable for your stats did that have a significant impact on single thread performance so if you actually look at the numbers not really now like I said I'm seriously thinking about turning them off by default since most of the time people don't look at the stats anyway but yeah yeah I'll try to measure it but I doubt it we've just on that we've cheated in some point you don't need them exact just don't use atomic variables and you get something it's kind of approximate but the two things we find that at least on the x3 that is blocking is concurrent direct IO rights to the same file and hitting contention on that and everything closing whenever anything runs sync which means you have transactional database systems we do that so yeah that's the big thing is yeah then we're still working on those since in networking we always have the same stats problem have you looked into per CPU stats for especially if it's per file system I did look at per CPU stats we are actually using per CPU stats for some things the problem is per CPU stats can actually be quite heavy weight they can end up taking up a huge amount of space and so we think hard before deciding whether or not we want want the stats at all people who did the IPv6 SNMP Mib decided that it was wonderful to have three copies of every stat per CPU memories cheap right yeah per CPU also hurts when you've got 4,000 of them the other way the other obnoxious point is improving X4 is wonderful but if I'm building a system why shouldn't I just go XFS if they are also improving performance and starting out from a different position that might that's probably better yeah there's definitely things where XFS is still going to be much better than EXT for for RAID systems in particular they have a bunch of optimizations that aren't in EXT for what one things I'm finding is that there are certain applications where XT for is still better either because the tools are a bit are still a bit more advanced inside Google we're using them because we can actually turn off the journal right we're keeping multiple copies of our data files around for redundancy in case a disk dies or a server dies and it turns out the journal is just simply overhead that we don't need given everything else that we have to actually ensure consistency at the cluster file system level so we're actually using EXT for with the journal disabled and that's something that you can't really do with XFS or say butter FS right which is always going to be doing copy on right allocations whether or not you need that needed so it's going to vary right I think a lot of people will also decide that they don't actually need the file system bandwidth and they're familiar with EXT for and they understand it and so they use it for that reason there are lots of different reasons people choose file systems and I don't want to say that any one of them is right or wrong if you have a file system workload that uses raid and really needs the IO band with XFS might really be the right solution for you have you found the Nick Piggins VFS scalability changes of did I hope you much on these benchmarks for the benchmarks that we were doing here thanks for mentioning Nick that was actually one of things I was going to do and forgot to there's lots of other scale building work that started up in the last year one of which is Nick Piggins scalability work at the VFS layer the benchmarks that we were actually doing here we're using FFS be which is the flexible file system benchmark and we're not really stressing metadata intensive operations so a lot of this stuff got kicked off by the real time team because they were noticing D bench was hurting really really badly once we fixed that one lock at that at that point all the problems they were seeing were problems that needed to be addressed using Nick's patches and so that probably would have been the end of the story except that Eric Whitney from HP kind of popped out of the woodwork and started feeding me benchmark numbers and it was when I saw the benchmark numbers that I saw what work we still needed to do and could actually do it and then one of the reasons why I want to give this talk was to really point out to people why having a good benchmarking team or in this case one good benchmarking person in some ways is all it's just as important as the developer because without that the developers kind of blind right and you know he knows the tools he could do it really quickly and then once he gave me the tools the amount of time that I spent writing these patches were actually pretty small compared to the amount of time Eric was doing to actually do the benchmark numbers. Any more questions? I'll take that for a note. I'll put our hands together.