 OK, excellent. OK, so I'd like to make a start. So my name's Chris Kenoway. I'm a FreeBSD developer. I've been involved in FreeBSD for about 10 years and in a number of capacities. Most recently, from the past few years, I've been working a lot on system performance. So I've done a lot of work on benchmarking, profiling. And we've, I think, made a lot of progress in recent years, especially with FreeBSD 7 in that area. So today, I'm going to be talking about some of the lessons. OK, how's that? Better? We've been echoed still. Straight here. All right, so I'm going to talk about some of the ways in which you, as a FreeBSD user, can analyze, go about analyzing the workloads of your system with the view to improving performance. So I'm targeting my talk at kind of power users. Hopefully, a lot of the methodologies will be applicable to not just users of FreeBSD systems, but more generally. But as a FreeBSD developer, this is what I'm focusing on. So some of the more advanced techniques I'll talk about towards the end are more focused on looking at kernel performance. So if you're not afraid to go and look at some kernel codes, it will definitely come in handy in applying these in your own environments. But at least, if you're not comfortable doing that, these, the kinds of information that you'll be getting out of this, out of running these commands, will be useful to pass on to kernel developers if you need help with your workload. Okay, so there are really four parts. Firstly, I'll talk about why it's important to understand what your system is actually doing. You're not going to get anywhere in trying to improve performance if you don't have a detailed understanding of what your system is actually doing. And as part of that, there are a number of tools that are available on FreeBSD systems to investigate those aspects. And I'll talk at the end about some tuning advice that applies in some situations. And finally, if I have time, I'll talk with some more general aspects of benchmarking, which seems to be quite a bit more difficult than people often realize. So the important thing to realize when we're talking about performance is that it actually isn't a meaningful concept unless we qualify it. And it only actually makes sense to talk about performance of a particular workload and with respect to a set of metrics. So before you can go about improving performance, you have to first know what you mean by performance. And the first step towards this is to characterize exactly what your system is doing on a particular workload and what aspects of its operation you actually care about. And depending on the answers to these questions, the way you proceed to improve things will vary. So depending on some examples being, if you have a web server, you may care about the bulk throughput. How many queries per second can your web server handle? You may also care about the latency of the query. So how quickly is each query handled? And these are different things that often require different approaches. So some of the ways in which your workloads can interact with systems, well, there are lots of them, the CPU use patterns of the workload can vary. They may involve disk IO. They may try to talk to network. They may be talking to other hardware devices. Applications can be misconfigured. So this is actually quite a common source of performance problems. The application is just not configured properly and it's often easy to misidentify this class of problem as being an operating system problem or a hardware problem, but it's actually just a configuration issue. Ultimately, you're going to be running into limitations of the hardware, assuming you can push things hard enough. And so it's important to have an understanding of what is your hardware actually capable of. Workloads will usually interact with the kernel in some way, either through system calls or through other operations. Multi-threaded workloads are common these days and there are a lot of badly written multi-threaded applications that often have high lock contention within the application. So there can be application design problems that may or may not be possible to work around without code changes. And finally, this is another problem that's often overlooked, is that if your system is part of a pipeline or part of a set of other systems that are all handing work to each other, any given system may in fact not be getting enough work to do. So it's not unheard of for problems to be reported. People come to us and say, help what's going on here. And the answer is really that the system isn't busy enough. You're not giving it enough work and this might be because of a bottleneck elsewhere in the system where it's not being fed in properly or it could be another configuration issue. So I don't discount that kind of problem. Typically, there'll be at least one of these issues that turns out to be the limiting factor. So the way I like to approach studying this kind of issue is starting at a very high level and then moving down to detailed investigation as you get pointed in the right direction. And a very good tool. So I should apologize. Some of these tools are very standard and I'm sure a lot of you will actually know in very great detail how they work, but I hope that at least I'll present some new information that you haven't been aware of. So top is of course, it's, we're probably most familiar with it here. But it's a great way of getting an overview of what's going on in the system. It shows you things like what is the kernel doing at a very high level. It tells you, for example, if your kernel is paging and this is going to be a kiss of death, if your workload is sized such that it cannot fit in main memory, then either transiently or in a steady state, it's going to be writing to and from swap. And any time this happens, if you get a disk involved in the critical path of your workload, things are gonna slow right down. So this is again something that's possible to overlook unless you happen to think about it. Top shows you if the system is spending a lot of time in the kernel or processing interrupts and then it breaks it down by thread. So you can look at which processes are using CPU and which threads are using CPU. It also shows you what that, for processes that are running inside the kernel, what are they doing or where are they blocked? If they're blocked waiting for a resource, you can get an overview of what's going on there. Unfortunately, it involves these cryptical kinds of abbreviations. And at least at the moment, there's no good reference that I know for previous D, at least, that breaks down what are the common weight channels, as they're called, and explains what they mean. But typical ones you might see are things like bio-read, bio-write, write-drain, which are telling you that the process is blocked waiting for some kind of disk IO, either read or write. SB weight shows up quite commonly and it's usually not a performance issue. This just says that a socket is waiting for input or it's waiting for IO. So this is sort of the typical state of a network server if it's not busy. There are some weight channels that tell you that you have a threaded application that is waiting on a lock or on a condition variable or so on, you can't do mutex. And there's a lot of these, and unfortunately there's no, so the only way to really find out what they mean is to go and grip the kernel tree. But as you get experienced with looking at these, you start to recognize which ones they are and which ones stand out as being important. So, top is the usual first step for seeing what's going on, and then you'll typically spot a problem and dig further. A related tool that exists in FreeBSD, or facility really, is the ability to ask any foreground process what it's doing. And this is something that's missing from other operating systems, for example Linux doesn't have this ability as far as I know, I would love it if it did. You can, by default, the control T key will send a SIG info to the foreground process and the TTY system will, has a default handler for SIG info and if you run your process and you want to know what's going on, maybe it's not giving the expected output or it's taking too long, so on, you can just press control T and it'll tell you that, in this case, the load average is 0.04, the foreground command is foo, this is the PID, and this is typically an interesting field here which tells you the weight channel. In this case, it's telling me that the application was waiting for NFS request, so this was actually doing NFS IO and maybe I hadn't expected this, maybe I thought it should be running to local disk but it was actually running to, doing IO to NFS and this is why it's taking a long time. So it also shows you what is the current CPU use in user land and in system, in the kernel and the CPU usage of the process and its resident memory set size. So this is all information you can get from other sources but having it available online is why pressing control T is, I find it invaluable for figuring out what's going on, especially for commands run from the shell. Okay, so this is what top looks like in previously. It's got the standard kernel summary at the top here which shows us some things about load average, how many processes are running, how many are blocked and so on, what percentage of time is being used in the system and so in this example I've shown here, this would stand out immediately that if you have, your machine is spending 63% of its time in the kernel, this is typically unusual and we look down a bit further and we see that the MySQL threads that are running in this case, a lot of them are actually blocked on this weight channel which turns out to be buff object and if you dig around and find out what buff object means, this is, it turns out it's waiting on buffer IO. We'll come back to this example and I'll show you another way that we could find out what's going on here. Well, at this stage we don't really know what's going on but we'll come back to this example later. And so useful options for top of capital H which shows you the breakdown each process by thread, so in this case there's a single process and it's showing five threads and the S option shows kernel threads or system processes. So when we have applications that are interacting with the disk, such as MySQL in the previous example, typically they're going to be limited by one of two things, either the bandwidth of the storage system or the latency, which is response time for a given operation as opposed to the bulk throughput for all operations combined. And depending on the IO patterns that the application generates, it's going to put a different amount of stress on the disk. So if you're doing a lot of random access reads or writes that require the head to seek back and forth, then there's going to be a lot of seek time where the disk is not perhaps doing anything and this is going to limit the amount of throughput you can get. Whereas if you're workload is structured so it's performing sequential IO to sequential blocks, then you're more likely to be limited by the transfer rate of the disk or of the controller. So there are some useful tools for studying IO operations. I've mentioned two of them here, IOstat and Sysstat can do this, they have a lot of other metrics as well. One very useful command that previously has is Gstat, which is part of the JOM storage layer system and this shows you for every JOM storage provider, it breaks down the operations that are currently pending. So in this configuration here, it's sampling once a second and it shows us for each of these storage providers, 86 is a SATA disk in this case and it has various partitions and there's also a CD-ROM driver that's not actually doing anything. So we can see here that the 86 device is doing about 1200 operations per second and there's a queue of about almost 1200 operations backed up waiting to proceed. Only one of these operations was a read, but they're all writes and the interesting statistic here for determining whether a disk is overloaded is not the last column as you might expect but it's actually the millisecond per read and millisecond per write stats which tell you how long on average did the operation take to succeed. So this is what actually tells you if the disk is overloaded, if it's taking very much more than the sort of steady state latency, for example, the read here only took 11 milliseconds but the writes were taking as long as 300 milliseconds on average. So this is indicating that the write bandwidth of the disk is being overloaded and so this is pointing to where an issue might lie. The percent busy column by contrast only tells you what percentage of time was at least one operation pending. So in this case we have as many as 1200 operations that are queued up. So these may be overlapped by the disk hardware and at any given time there may be several operations in the process of being completed or there may be as few as one at any given time and so percent busy only tells you what percentage of the time the disk was doing something doesn't tell you how hard it's working. This is a very common misconception when people look at GSTAT, they think my disk is 100% busy whereas actually it may be able to do a lot more work by queuing up operations. So the latency is the key thing to look at there. Okay, so looking at, so how do we find out what actual processes are doing the IO to the disks? It turns out top can actually do this as well. Top has a switch minus MIO which instead of displaying CPU usage displays IO usage and you can sort various ways but sorting by total ordering is usually the most interesting and this shows us here that it's in the same example as before the MySQL threads were doing, each of them were doing about 250 reads and writes per second and if we compare this to the GSTAT we see there actually weren't many reads from the hardware, these reads were actually satisfied from cache but the writes were actually had to hit the disk and it shows you what percentage of the total IO each thread is using. There's some other interesting stats here. These first two columns showing the voluntary context switches and involuntary context switches I'll come back to what that means a bit later. And just a minor caveat which is unfortunately at the moment ZFS doesn't support the IO stats. I'm not sure why this is but this is a bug that's outstanding at the moment so you won't see anything if you're looking at processes doing IO to ZFS. Okay so suppose we've identified a disk problem or we think we're seeing some higher disk latency. What can we do to fix it? So well disk is typically a shared resource that's accessed by many processes and if this sharing of the resource is what's causing your problem the obvious answer is to make it not shared. So you can reduce disk contention by moving the IO jobs if you have two processes that are each doing IO to the same disk either move them on to independent disks or if you can't restructure the application to use separate files and paths you can look at striping multiple disks with something like GStripe so that you provide one logical file system to the applications but they're actually backed by multiple physical devices and each of those can handle the IO independently if things work out nicely. So some caveats to be aware of. When you're striping across multiple disks you want to make sure that your file system boundary is actually stripe aligned and that the stripe size is in agreement with the block size of the underlying disk that's typically not an issue. But this first one can be important if you for example are using 64K stripe sizes then you want to make sure that the first the start of the file system is actually also going to be stripe aligned on the disk otherwise IO to the first well IO to that is file system aligned will actually be split over multiple blocks and so writing a single block to the file system can require writing two blocks to one each to the underlying disk so this can cause performance problems when you're striping. Finally there's always the option once you've determined that disk hardware is an issue of actually adding faster or better hardware. I'm emphasizing that adding hardware should be a final step in the process not an early step in the process because there are a lot of cases where adding hardware may actually either not solve the problem or in some cases can even make it worse so you really need to understand what's going on before you get to that point. So something that is possible in some cases but not all is to restructure the workload so that you separate critical data which needs to be persistent across crashes or restarts from scratch data which can either be reconstructed cheaply or thrown away. For example temporary files usually don't care if the application crashes and you have to restart you can just either forget about them or carry on from the primary source. And if you can separate out scratch data then IO to the scratch data can be made unreliable in the sense that you can either use a faster system that may not keep the data after a crash. For example if you mount asynchronously then unclean shutdowns if you have a sudden power loss or the kernel crashes or something the faster it may get corrupted but if you can just new-office the file system and carry on where you left off it doesn't matter. You can often go one step further and store temporary data or scratch data in memory and if you can eliminate the disk entirely and do IO to a memory file system then you'll typically get a large performance increase from that. So in previously this is typically you would use a swap backed memory disk something like that command there and then mount it asynchronously. The swap backing is a little bit misleading. It will use swap if there is insufficient main memory to satisfy the request. So if there is memory pressure and you're working set of the memory disk plus application memory exceeds physical RAM then data will be pushed to swap but this only happens when memory is low. So it's not going to be writing to swap with every IO request. It's only going to happen if needed and typically you could then add more memory to prevent this or resize your application. Okay, so moving on to the next topic which is network activity. So net status is one of the built-in tools for looking at what the network is doing. So net status minus W will show you a breakdown of traffic per second in bound and outbound on a given interface. So if your application is talking to the network and things are going slowly, you can check, is there a traffic matching expectations? Maybe there is not enough traffic coming in and so this can point you at an underutilized server, for example. NetStack can show you protocol errors so there's quite detailed statistics for things like UDP checks on failures, TCP rent through transmissions, a whole variety of corrupted packet statistics and so on. I can also tell you about interface errors which it's perhaps less common these days but depending on the hardware and if you have a bad switch involved, it can misnegotiate the duplex settings or the line rate settings and then you typically get very bad performance on that link. And then there are various tools for studying in detail what's going on in the network. TCP dump is the sort of classical one. Endtop is quite useful, it shows you again per process what is what processes on the local machine are using doing traffic. Wireshark is a very detailed tool for studying doing packet decoding, protocol decoding, that sort of thing. So if you suspect a network problem, what can you do about it? Well firstly, check everything is seems to be configured properly. Packet loss is going to kill any sort of network throughput. For some applications, the size of the socket buffer may be important. There are some SysControls, well there's a SysControl, the current IPC MaxSockBuff which sets the maximum socket buffer and some applications we found have been particularly old applications. The socket buffer was set explicitly to some very small values, say 32K which may have made sense 20 years ago but is no longer an appropriate default. So you can check into the code, see if it's setting the socket buffer size explicitly and it can be set anywhere up to the maximum enforced by the kernel. For UDP applications, you may need to increase the amount of space available for receiving UDP packets. UDP will drop packets, the kernel will drop packets if the buffer is full. So if your application is not able to drain UDP incoming packets quickly enough, then you can get packet loss because the kernel buffer fills up and depending on the application, you may need to increase the size of the kernel buffer to keep this from happening. The privacy TCP stack is pretty much self-tuning so there aren't magic SysControls to do this. Well there are a lot of SysControls but these typically don't need to be set. One possible exception which it's an issue that has been implicated on occasion in performance issues. I don't know if this is still the case, it's possible this has been fixed but there's a setting, the TCP in-flight SysControl is enabled by default. This tries to do some bandwidth estimation for traffic on a LAN. It's been rumored to cause problems so you could try tweaking this, turn it off, see if it helps, it may not, it's likely it won't help. But if you do find this is a problem, actually I'd like to hear about it because this has been a question mark over this for some time. Don't rule out hardware problems, especially if you're using a fairly low-end NIC then these do fail. It's sort of good advice to keep in mind generally. We tend to think of, especially if we buy a very expensive piece of hardware, we like to tell ourselves it's never going to fail and unfortunately this isn't true and so always keep in mind that this can happen. Okay, so the third topic I had on my list was device IO and so this will show up in top as a large amount of time spent charged interrupt processing in the top header and the VM stat minus I command will break this down by device so it will tell you exactly what interrupts are firing to generate this high load. So in this example here, this shows the various IRQs and the rate at which the interrupts are firing over the past second. So in this case we see we've got 1,000 interrupts firing on IRQ 19 and the plus after the device name here indicates that it's a shared interrupt and actually it's being the same interrupt is being shared by multiple devices. So this is, can be implicated in performance problems if you have, particularly if you have one or more devices sharing the interrupts that are both still require the giant lock. These are in some sense legacy devices but if you have two devices that both require giant to share an interrupt, whenever one of them gets in, whenever an interrupt fires, both devices will need to wake up, grab the giant lock and then fight for it of course and check whether the interrupt was directed to them. So if you have this sharing situation then giant lock drivers can cause a performance problem and even if not then, well, if you find such a giant contention issue from your interrupt, shared interrupt, then you may be able to get away with either removing a device from your kernel if you're not using it. For example, in this case if USB was implicated in a giant contention problem then if I'm not using USB, I can just remove my kernel and work around the problem. Sometimes you can resolve this by physically removing a device to a different PCI slot, for example. But that may not always be possible. Okay, so coming back to context switches, this was shown by the top IO list. And so the two types that were listed there are voluntary and involuntary. So voluntary context switches occur when a process blocks waiting for a resource. So it makes the decision to try and acquire a resource and this will possibly block. And this is called a voluntary context switch. Involuntary context switches are when the kernel decides that the process, it's time for it to stop running, it's had its chance at the CPU and now it's time to run something else. So context switches can be indications of performance problems. They can be a symptom, for example, of resource contention in the kernel. For example, if processes are blocked on a mutex or if they're contending on a mutex this will show up as contention, as high context switches. It can also be indicating an application design problem because, for example, if you have a multi-threaded workload and you configure the application to use too many worker threads compared to the amount of work that's being done per thread, each thread will run, do a tiny bit of work, go back to sleep, or maybe block on a lock. And so you spend a lot of time switching between threads and not enough time actually doing work. So this can indicate issues either in the application or in the kernel. So typically applications will interact with the kernel through doing system calls. And this is another way in which things can go wrong. So here VMstat is a tool that can show you the rate of syscalls that happen. Again, this is a high-level overview. It shows you what's going on system-wide. And the relevant column here is this SY column. And the first line is an overall average since the system booted. And then subsequent lines are instantaneous values over the previous second. And this shows us that this workload was performing 700,000 syscalls per second. And this, even on a large S&P system, this is a lot of syscalls. So in this case, if you saw this, you would, this should raise a red flag and point to a problem. It also typically shows up if you're doing a lot of system calls, then each system call is operating inside the kernel. And so it is charged as kernel CPU use. And we also see it appear here in the system CPU percentage use column here. This would also show up in top as being the system CPU percentage. So we're spending, in this case, about 60 to 64% of the time in the kernel satisfying or processing syscalls. And that's unusual. So that points to a problem here. So how can you go about digging further? So there are various tools, for example, Ktrace and Truss. Strace is another one that let you attach to a process and it prints out a list of various syscalls that the process does. So it's quite a raw feed of data. So it can be a lot of data that you have to post-process or grep or so on. But it tells you exactly what the process is doing, what it's, every time it enters the kernel. And this, if there's a typically, this kind of magnitude of problems will show out very clearly. If you're doing 700,000 of them, then typically this is gonna be a small number of syscalls that are happening very frequently. Another interesting and new way of studying what processes are doing with how they're interacting with the kernel is using the audit subsystem, which is, I think it's been present since the previous five days, but it's certainly, it's relatively new and it's probably not very well known. This is intended partly as an audit trail facility so that you can, for example, get secure logging audit trails of what processes do and what system activities occur, such as logins and so on. But it can be configured to do very fine logging of process activities, including logging each syscall with arguments and so on. So this can be great to actually, if you turn on audit, you get a feed of what's coming out for each process and you can filter this in various ways. Typically with this high volume kinds of data flows, you want to try and log to memory disk if you can because you don't want the IO to be slowing down the process. So here if we run ktrace on my SQL process that was generating, it was IO, we see that over and over again, it's doing the periods syscall and reading a few bytes out of the kernel. And what this points to in the end is the application was misconfigured. Every time, the caching parameters weren't set up properly, so every time it wanted to read from the database, it had to read from the file system. It wasn't caching it in user land. So this kind of thing is going to kill performance. If you have each time you wanted to read from the database, you have to cross into the kernel and then it would be satisfied from cache, but it's cached in the kernel, which is too far from the application to be high performance. Okay. So I've mentioned some aspects of kernel activity that can be implicated in performance issues. Something else that can show up on your workloads, although hopefully we'll show up very rarely, is high lock contention on kernel mutexes. This can indicate a kernel scalability problem. It can also indicate an application problem again. Perhaps the application is misdesigned so that it is using, for example, it has high lock contention on its pthread mutexes. These will show up in the kernel because of the way the previous user land lock implementation is. It will actually, in some cases, it will enter the kernel to block. So it can, again, either indicate a kernel or a user land problem. An interesting tool for studying what processes are doing in the kernel is Proxtat, which appeared in previous D7. It has an option to obtain a stack trace of processes that are blocked in the kernel. So if you have, if you see from other signals like top or some of these other ones we've seen, that the process is entering the kernel a lot, Ktra, sorry, Proxtat will show you exactly what is the stack trace. And from this, you can hand this to a developer and this will give information about what's going on. It also shows various other useful information about the process. So looking at kernel lock operations, a very useful tool is lock profiling. This is an option you can compile into your kernel. And when you turn it on, it will record every time a lock is acquired, either a mutex or a read-write lock or some of the various other kinds of locks, locking primitives that we have in FreeBSD. There is a performance overhead when it's in use, when it's actively profiling, which depends on the hardware time counter because if it needs to read the time every time the lock is acquired and released, this is two time counter calls. So generally you want to use the fastest possible time counter, which is usually the TSC on modern hardware. On very modern hardware, this is actually usable on S&P systems. On older hardware, it wasn't. So if you can get away with this, you need to enable it with the SysControl here. By the way, my slides will be available. I'll be providing them to the FOSDEM guys afterwards, so it should be on the FOSDEM website. Okay, so how do you use this? You enable a SysControl, do your workload, and then turn it off. And there's another SysControl I'll show on the next slide, which dumps the output. But the data that's recorded are things like the file and line in the source code where the lock operation was defined, either the mutex acquirer or the sx acquirer or so on. And then it aggregates useful statistics like what is the maximum time this lock was held? So this shows you perhaps there's a code path where it's doing a very slow operation while holding a lock. And while this is true, while this slow operation is in progress, nothing else can acquire the lock. It shows you the total time that was spent across all acquire operations that was spent waiting. It was spent block waiting for something else to release it. And the average times and how many times the lock was contended and so on. So this is what a typical output will look like. And this was a somewhat contrived example. I had to fiddle the numbers actually, but the typical thing you see here, so I'm sorting by total wait time. And when you have a high contention situation, you'll typically find there's one particular lock that is we spend a lot of time waiting for. And this can point to a bottleneck. In this case, there was a bottleneck involving this name cache mutex. It was actually doing a lot of operations. This was required every time something tried to stat a file or tried to resolve a name cache entry. We were able to just convert this to a read write lock. And most of the operations can now be held with a reader lock instead of acquiring an exclusive lock. And so in 8.0, we fixed this, and some workloads are now seeing 20% performance increases from this. There's a tool I won't have time to go into. It would be an entire talk on itself. And in fact, it was an entire talk earlier today, which I guess most of you missed out on, unfortunately. Detrace is part of FreeBisD as of FreeBisD 7.1. It's a system that Sun introduced in Solaris. And it's now part of OS 10 and FreeBisD and possibly other operating systems in the future. But it's really a very powerful way of writing small scripts that are executed upon probe events, so that there are a whole bunch of probe events to find either in the kernel or in userland these can be things like function entry and exit, or you can define your own trace points that occur perhaps in some high level operations, like beginning and IO, ending and IO, that sort of thing. And you can configure the script to be executed upon any of these probe events. And then it can aggregate statistics like how many times was it called? What was the average value of some argument at this time? What is the latency between beginning and ending of an operation? So this is a really powerful way of drilling down and finding out exactly what's going on anywhere in the system. At the moment, it's only supported for profiling the kernel in FreeBisD 7.1. Hopefully, in the near future, we'll also finish userland tracing. But this is a great thing that you should check out. There's a YouTube video, Geatrace Review, and Sun has some great docs on that as well. So have a look at it. It's really cool. So modern CPUs have a lot of performance counters on the silicon. And FreeBisD has an interface for accessing these and using it to profile the application and the kernel workload. So you can profile things like where did the CPU spend most of its time retiring instructions? Where were the cache misses that occurred? Where did it mis-predict branches? That kind of thing. And the HWPMC tool in FreeBisD can either do instruction level profiling, where it just tells you that some percentage of the time, the instruction pointer was at this line of code when this event happened. Or it can also reconstruct the core graphs of the process. So it can tell you that exactly what sort of functions were called to reach that point. So this is a great way of profiling what either your user application is doing or also what the kernel is doing. It has a very low overhead when it's running, because it's actually using the hardware. It's using things built into the hardware rather than having to emulate things or do profiling in software. So this is a really useful tool. And it post-processes the GPROF. So if you're familiar with GPROF, it accepts the PMC stat output. So again, there's a short how-to about how to use this in FreeBisD. There's an autos mentioned in passing. There's a nifty tool that you can use to visualize the scheduler activity, so you can see how your processes are being scheduled, where they're blocking. Maybe they spend a lot of time blocking on a resource or yielding CPUs to each other. And this shows up, it shows you graphically exactly what each CPU is doing and how, why decisions were made to change scheduling. FreeBisD8 has a debugging tool called SleepQ Profiling, which shows you the aggregate of these weight channels that I mentioned a few times earlier. So it can show you how many times the process blocked in any given weight state. For example, this was a typical output here, and it may show you that it's spending more time waiting on certain resources than you thought it should. OK, so a few words about kernel tuning. So FreeBisD is largely self-tuning. So there's not a lot you need to do to a typical system to make it work well out of the box. The defaults are pretty good, and things will also auto-size and auto-tune based on either the hardware that it sees or, in some cases, on the workload that it encounters. So the best advice for getting performance out of a FreeBisD kernel is to run a modern kernel. We put a lot of work into FreeBisD 7, and there's obviously a lot of ongoing work into improving performance. So a good first stop is to make sure you're running the most recent version. So the ULE scheduler was new in FreeBisD 7. Well, rather it was new a few years ago, but the version in FreeBisD 7 was rewritten, and now it's free from the performance problems that it had in the past, and actually performs very well. It's the default in FreeBisD 7.1, so this is only relevant if you're using 7.0. But it will typically give a performance increase on most workloads. And this comes from, it does a lot of work to maintain CPU affinity, which can really help. If you can keep the caches warm in between scheduling decisions, then you'll get a lot better performance than if you have to keep satisfying cache misses. FreeBisD 8 has a system called superpages, which is the equivalent of Linux's huge TLBFS, which is using larger TLB entries. The difference in FreeBisD is that it's all automatic. You don't have to do any manual configuration, any manual changes of the application. The kernel will automatically promote 4K pages to larger pages on demand, and then deals with the fragmentation issues that can happen, and so on. And actually, it's on by default as well now. So if you're running FreeBisD 8, then superpages will be on by default. And this can also give, depending on the workload, it's common to get a 10% or 20% performance increase from running this, especially if you have a very memory-intensive workload, such as Java, for example, really benefits from this. If you're running development version like FreeBisD 8, then debugging is enabled by default. So obviously, that's not going to help performance. So make sure you turn that off if you're trying out the developed version. Some other applications do strange things with the time counter. For example, Java 1.5 does an insane number of get time of days, his calls. So it's really, for some reason, it wants to know exactly what time it is all the time. And it can actually matter that if you're using a slow time counter, so this is somewhat workload-specific, but keep that in mind as well. OK, so in my last few minutes, I want to say a few words about how to go about benchmarking a system. So suppose you identified, you think there's a problem, and you think you have an idea about how to fix it, or at least what to try to fix it. Benchmarking turns out to be one of these sort of annoying things to do properly. And so people attempted to skip steps, and this often can bite them afterwards if they do that. So the key that you want to do is to identify a self-contained workload that you can repeat as many times as you need to. So if you're trying to measure things, the idea is to minimize the number of variables at any stage, so you want to keep everything constant, but then vary only one thing. And if your workload itself is varying, then anything you do is going to be more than one. So you want to keep the workload constant and repeatable so that you can demonstrate the problem clearly and then make changes one at a time against that workload. So measuring, if you have a metric that can get a number out of this benchmark and you want to compare the numbers, it's actually, it turns out humans are very bad at comparing a set of data by eye. We tend to miss patterns, we tend to read our own meaning into things. We tend to, if we think something should make an improvement, we tend to only look at the data that shows an improvement and not look at the data that doesn't show the improvement. So you really want to trust in statistics to do this and previously has a really useful tool called MiniStat that takes, its only job in life is to take two or three sets of data, say two sets of data, which could be the output of a benchmark either before or after you made the change and then it will do some statistical tests on this data to determine, statistically, do these data sets represent different, come from different data sources. Use something called a student's t-test which can distinguish between data that was coming from different sources. For example, a kernel with a performance problem and a kernel with no performance problem. It's very easy to, when you're looking at a set of data where some numbers are higher but some numbers are lower after the change, you're not sure if it's made a problem. It's easier to assume that the numbers are actually going up where the data may actually be insufficient in a statistical sense to draw that conclusion. So MiniStat will tell you when this happens and then it tells you you need to get more data out. So for example, this was a simple thing. I ran my SQL benchmark with the two schedulers, with the older 4BSD scheduler and with the ULE scheduler that's now defaulted previously, 7.1. And the numbers that I got out represent the number of transactions per second the database is doing on this benchmark. So in this case, higher is better. And passing into MiniStat, the W60 says the width 60 which fits on my terminal. It does a little histogram which is quite nice showing that X, which is the 4BSD numbers are down here and plus which is the ULE numbers are up here. It shows you the average and also the median which in this case are on top of each other so we can't see them. But A is average and M is underneath it. And it shows you once and a deviation either side of that. So it shows you what is the variance of the data. But the most interesting thing for doing benchmarks is it actually tells you that you can be confident these numbers are actually representing a real change in the data rather than just a measurement fluctuation or some random event. So the way you read this is that you can be 95% confident that the second set of numbers which are the ULE numbers are sampled from a data set that has a mean 29 higher than the first one. So in this case, it means that we're getting 29%. So we're getting a 29% improvement on transactions per second on this benchmark. So Minostat takes your two sets of data which might be noisy and hard to interpret and gives you an actual concrete interpretation to it. OK, so I mentioned throwing hardware at the problem is not always, well, I say it should be done after you've exhausted the actual debugging of the problem. For example, it's tempting sometimes to throw hardware at the wrong problem. For example, if you're adding more CPU cores but your problem is a slow disk, of course, that's not going to help. Sometimes adding RAM can help if you have either an application workload that is too large to fit in memory or if you're doing a lot of reads from disk and you could benefit from extra caching but your working set is too large to keep the cache data in memory, you can fix that by adding more RAM. It's interesting to point out that sometimes adding more CPU cores can make the workload slower. But if you have a workload that is very highly contended on some shared resource such as a lock, then adding more CPUs that are all going to come in and contend on the same lock is going to slow the other ones down. So you can actually make resource contention worse by adding more CPUs, which is sometimes counterintuitive. OK, so hopefully I've given at least some approaches to take in investigating problems that you might encounter on your systems. If you're still stuck, then hopefully the kinds of techniques and the commands that I've given here will at least be helpful to a developer. So if you get to this point and you still don't know what to do, providing the output of these commands like, for example, showing what's top is showing a typical snapshot of what's going on, showing if there's some sort of high CPU usage in the kernel, you may be looking at VM stat, maybe turning on lock profiling. The more data you can provide to a developer, the easier their job will be. And we really like it when you guys come to us with all this output. We don't have to do more round trips and say, please do this, please do this. So at the very least, if you're not able to get to the end of the problem, at least you've gone part of the way. And in the context of FreeBSD, if you need help with this kind of thing, then the best list to ask on, depending on how technical your understanding of the issue is, general support questions are best asked on the questions list, more technical questions. If you have some insight into the code, maybe you've already debugged it a little bit. You need some more help from somebody who understands the source code. Come to the hackers list. And as I say, hopefully at least, given this information, we'll actually be able to help you work things out. OK, so with that, I'll finish. And I'll be happy to take any questions anyone might have. Thanks. Yes. Sorry? OK, so the question is, what is in the files that I passed to Ministat? And this is just a rule list of numbers. So in this case, the first set of numbers was in the range of 2001 and 37 to 2001 and 61. These were the output of my benchmark. My benchmark did a whole bunch of stuff and spat out a number at the end. And this number represents, in this case, transactions per second on this database workload. But the important thing is just the number. OK, so the question is, how do you get this data from the benchmark? So this really depends on what the benchmark is. So you need to have a way of turning your workload into a number. And this number could be, maybe it's the bandwidth the network interface is able to perform, or maybe it's the number of queries per second. It really depends on your workload. So whatever that is, turn it into any number. And Ministat will tell you if these two sets of numbers are likely to be sampled from different data sets, meaning that something changed from one to the other. And it will tell you when they don't change as well, which is when you didn't tune the right thing. Any other questions? OK, I'm not seeing any hands. So thank you very much.