 Ac mae'n gweithio. Dwi'n gweithio. Mae'n dweud i griffd a ddefnydd. Arw'r cig ar y llyfr yn fwy ond. Mae nhw'n fwy o'r trio yn y bwrdd i ddoedd hwn o'r mhol yn ddangos ni. Mae ganddo diwedd am y beth o'r ffordd sydd gyda Lastiau Tylen. Rwy'n gweithio'n cymorth dem o ran yn 我llawer o'r rhan o'r rhan o'r rhan o'r rhan o'r lathlunio. Mae'n gweithio, Ysdol, ddim gweithio'n gweithio'n gweithio. Felly maen nhw'n gwybod fel gwahau'r buswch. Rwy'n dod i chi ddwy'n gwneud hynny. Felly maen nhw'n gwybod fel gwahau'r buswch. Rwy'n gwybod fel gwahau'r buswch. Felly mae'n gwybod fel gwahau. Rwy'n gwybod fel gwahau. Mae'n rwy'n gwybod siwdd y slyw yng nghymfaen. Mae'r pwysig y gallan yn oed yn gyfnodd gyda'r channel. Mae'n gwybod fel gwahau. Felly mae'n... Mae'n gwybod fel yng Nghymru mae'n gwybod fel ei fydd o'r cerdd. Pan布! depressing message it was too loud here Neil testing testing one two one two is that a better level okay good morning everyone thank you for attending my talk this morning I'm very sorry that I wasn't able to attend dev problems for the last moment i dwi'n meddwl, ond o ran gael cyhoedd yn y YouTube, os ydych chi'n mynd i'ch gweithio gyda'r 40 min ysloedd yn y bwysig o'i ddweud o'r agtor neu yn ymdweud fel gweithio. Ac oeddwn i ddiwedd i'ch gweithio ymddangos y prosiect sydd wedi'i ddwy'n gweithio'r ymddechrau, dwi'n gweithio'r prosiectau, ond ym dangos y cerdd, ond mae'n gweithio'r prosiectau yn ddod o'r gyfer yw ysgolol 12 mwy o'r ddiwedd. It's available in all distributions that are shipping a recent version of LVM2 and device maffers, so Fedora, REL7, there is an update on the way for REL6, and most other distributions that keep fairly close to upstream. So, just to kind of set the theme for today's talk and provide some background, the project I want to talk about today is DM stats or device maffers statistics. This is something that exists in two components, a kernel piece that has been around for a couple of years, and the new user space support which I'll be talking about today. Obviously IO statistics are not a new thing, they've been around for many years, really coming together to form the sort of statistics we see today in the 1980s. So obviously this is something that needs kernel support, the device maffer case is no different here, so the basic IO stats that we have on Linux, the block layer was modified to keep track of certain events. So when an IO is submitted or when it completes, an event counter is incremented, and you'll see that most of the performance metrics use this very simple counter-based model. So, as events occur, the counter is incremented, and we can then look at the counter value, or more importantly, how it's changed over a particular period of time to determine the level of activity for different events going on in the system. So, the kernel side, as most things in the kernel, it likes to keep things simple, and the counters are all implemented as plain integer values. Now clearly when we come to look at this as human beings, or to carry out additional processing and calculation with the data, that may not be the most convenient form to work in. We may want rates expressed in terms of human readable quantities like megabytes or gigabytes, and to have those rates expressed in familiar units of time per second per minute or whatever is convenient. So we need the user space tooling to provide this quantisation to time, rate conversion, and also possibly to do higher level tasks, things like aggregation and sorting. If you used some of the more advanced options for IO staff that allow user group devices together, you may be familiar with the kind of ideas I'm talking about here. So, to look at Linux in a little bit more detail and what we have today, or what we have prior to the introduction of DM staffs. The current IO statistics framework has been present since roughly late 2.4 or early 2.5. There are some minor differences between the two, but for today's purposes 2.4 is kind of historical anyway. So, what we have now is a set of 11 event characters. These are documented in the IOStats.txt file in the kernel sources or in the package documentation that you'll find on your system. And these track quantities like the number of reads completed, the number of writes completed, and the number of Q merges where we see that two IOs are adjacent and stick them together to build a larger IO to send out to the hardware. And these low level raw kinds of values are then consumed by user space tools, historically primarily the SysStat package. That includes things like IOStats, as well as SAR, the system activity reporter, and its data collector component. Today these tools look a little old-fashioned in some ways. They've been around in more or less the current form. There's over 20 or 30 years if you look back to tools that were available on commercial Unix platforms in the 1980s. They're generally very similar. Today there are also more modern alternatives to these that Mike Sugeau uses better. PCP is a project started by SGI. It's short for the performance co-pilot and it's a fairly large set of tools for everything from data gathering, pulling data into the system, recording and archiving, and then presenting or transmitting that data for further analysis and use. PCP is a large topic, so I won't be going into it any more in today's talk. There are some good talks on YouTube and it also tends to come up at conferences largely because it's a relatively new tool that people, at least in the Linux world, are not so familiar with. There are also higher-level management tools, things like the OpenLMI suite, which supports the EMTF SIM data model. These today can also read out the statistics from a Linux system and provide them for further processing and use. So this is the state that we have today and it's been a good model. It's served us well for a number of years, but it does have some limitations. In particular, there is just one counter set per device. This is regardless of how large the device is or how it's composed. If you've ever used Linux MD or LVM on a system with more than one physical disk, you're familiar with the idea that a single block device may be composed of multiple component devices. Perhaps it's a RAID array, so we're distributing data and possibly parity information within that. Or it may simply be a composite device, for example a logical volume that has been extended several times, so that it has disjoint data regions, possibly spanning multiple disks. In these sorts of situations, it may be useful to have a little bit more insight into particular areas of the disk. Without that, we're just getting very coarse averages over what may be a very large physical device. These counters today are also shared by all users. There's a simple technique used that allows us to effectively share this resource among multiple users without interference. That is just that the counters never reset, they just constantly increase. This means that anyone using the counters needs to maintain two copies of the data. We need to maintain the last thing that we saw and the current snapshot. We then take the difference between these two and that gives us the observations for the current interval. It's a relatively small overhead, but it is an overhead and it does complicate code that needs to read and manage these values in user space. There's a relatively fixed set of performance counters here as well. This is partly down to historical reasons. If you remember the introduction in 2.4 and the later 2.5 changes, you'll know that they're in different places. The reason for this is that once we introduce a file in PROC, and in the case of IO stats, this is slash PROC slash disk stats, that becomes part of the ABI, part of the interface with user space. That means that we can't freely make changes to that to add additional fields. We're somewhat more free in the device map of world here in that we use careful versioning of both the interaction with the curveball and the library in user space, which means that we have a much better path to introduce new additions or changes in a controlled map without breaking older systems. Another indication of the classical disk stats is that they use the kernel jiffy counter. If you're familiar with the kernel's timekeeping, you'll know that we have multiple different sources of timing information now. The jiffy counter is a very coarse millisecond resolution, or I should say roughly millisecond resolution counter, that ticks up traditionally every time the timer tick went off. Today you may not have a regular timer tick, but in principle it's just the same. There are two real problems here. Firstly, we have a limited resolution, we can't get better than millisecond accuracy, or precision I should say, and also there's an accuracy problem. Depending on what's happening on the system, and especially in virtualised environments, the jiffy clock may drift. So this time a millisecond may be a little bit too long, next time it might be a little bit too short. Depending on what you're doing with the data, this kind of variation and jitter may be problematic. The last major problem is very little insight into the latency that your IO is experiencing. We do have some averages produced, so the average service time and the average time that an IO waits before being issued. But these are just coarse averages. If we're seeing very high rates of IO, I remember we have devices today that are capable of a million or more IOs per second, this single average value gives us very little insight about how the latencies are actually distributed. And often in today's performance analysis, it's that distribution, at least some overall rough shape that we're mainly interested in. So this has been the sort of state of the IO statistics support in Linux for some time, and as I said it has largely served as well. We should examine why today this is becoming a problem, or why we would like to have some more capabilities to get better insight into what's going on today. A major part of the reason is that the storage stacks have changed. To borrow a phrase, it's not your father's storage anymore. So today we have things like software defined storage. If you've used things like SEP or Gloucester, you're familiar with the idea here. Rather than building large proprietary hardware disk arrays and presenting them over private channels or some other standard protocol, we build much cheaper arrays of commodity hardware with locally attached storage, and we then use software and networking to make that storage available to the client systems. We also today see much more heavily tiered storage architectures. One of the first ideas of tiered storage is a sort of layered cache where we have a hierarchical storage model, and data is automatically moved between tiers depending on its usage patterns. So we may at the front end have in-memory caches moving through fast local storage like SSD or Flash made available over the PCI bus, and eventually we may arrive on something like a tape silo at the slowest, largest end of the hierarchy. Where these tiers are implemented in Linux, we may want to have better visibility into the path of an IO as it tracks down through those tiers and layers. Obviously I work on device mapper, and two of the major additions in device mapper over the last few years have been the addition of caching and thin provisioning targets. The thin provisioning target also is responsible for the much better performance we now have with device mapper snapshots. Again, these involve breaking up storage up and assigning certain data to fast low latency high throughput storage, while other data is moved out to slower back end storage. With thin provisioning, we may have volumes that are initially only partially provisioned, and then become fully provisioned or more provisioned as IO is sent down to different regions of the device. This is somewhat related to the next point. In fact, the example I have here, which is REV, predates the thin provisioning capabilities in device mapper, and it uses a different mechanism for multi-tenancy storage. This one's an interesting use case for device mapper statistics, because the way that REV operates, it takes a single LV, slices that up into pieces, and each piece may be assigned to a different virtual machine. So, we do have this situation where we have one block device containing a number of logical disk devices that belong to different virtual machines. Again, this is somewhat related to the container of virtualisation world, where we may have many images all packed onto a single device. So, this leads us on to what we might wish to monitor and observe on these devices. The first point is the one that the old IO stats and disk stats mechanism is incapable of. It's not able to perform monitoring on subregions of devices. We might want to use this to identify hotspots, region of the disk, which receives a large volume of IO, and that may be becoming a bottleneck. If we were in some sort of an environment with cache devices available, that's the sort of region that we might consider caching in order to give it a performance boost or to remove a bottleneck. We may also want to carry out object discrimination. So, one block device, say if it's got a file system or a database stored on it, is going to contain many different objects. If the file system or database can tell us where those objects are, then we can provide statistics specifically for those regions of the disk that are used by the object. As I mentioned earlier, latency characterisation is another important area. One way to make this available is to track some kind of a histogram. A histogram shows us directly the distribution of a set of values by dividing the value space up into buckets and then accounting each individual IO into whichever bucket it fits. We build up an overall picture of how the latency is distributed among IOs. We may also want to grant different users on the machine access to the statistics facility in a way that doesn't interfere with other users. As we said, the old IO stats and disk stats uses this simple trick of global non-resetting counters, but when we're providing some of these additional mechanisms like latency histograms, we need something a little bit more sophisticated for that. We also want to be able to monitor and potentially respond to the overheads that our statistics collection is imposing, either in memory, CPU or other terms. Since the device map of statistics provides much greater flexibility and options, it does naturally impose a higher overhead. We'll take a look at how we can track that in a moment. Press on then to a brief discussion of the VM stats tool. As I said, this is available now in most recent distributions. The kernel side of things has been available since 2013. It introduced the general VM stats interface, which uses the device map and message facility. It allows us to set up arbitrary regions of devices for statistics tracking, and we can also divide those regions up into chunks. This was updated in 2015 to add two important new features. These are higher precision nanosecond resolution counters, as well as user-defined latency histograms. What's available now in current device map packages consists of two components, a user space library that application programmers can use to directly access the stats data, and also the VM stats command, which allows us to interact with the facility from the terminal. The command allows us to create, delete and monitor regions, as well as to print the current values in the array converted form as provided by the library. The actual command interface, if you've ever used the VM setup program, it's very similar. We have the VM stats command followed by any options or switches, and then a sub-command to indicate what we want to do, create, delete, and so on. The reporting functions of VM stats use the existing device map and LVM reporting framework. Again, if you've used that to set the fields that you want to use, or any sorting or selection criteria, that's also available for the statistics. There's a manual page with examples and full usage information in the VM stats page. One of the main ideas in VM stats is, and I briefly mentioned this on the last slide, this notion of regions and areas. A region is just a range of sectors that we're tracking statistics for, and we can break that down further into areas. What we mean by an area is this portion of the disk will have its own independent set of characters, so we can tell whether there's a lot of IO happening in one area, or in the adjacent or any other area. You can create an unlimited number of regions of any size that you wish. Obviously, these do impose a memory overhead, and you'll be limited by the amount of physical memory available on your machine. There's also a little safety check in the kernel that we don't allow the statistics data to exceed more than 25% of available RAM. To create a region, we use enough to create a command. We can control the number of areas or the size of the areas that we create. One thing to note here, you do have to set all the options you want when you first create a region. If you don't and you want to change them later on, it's not a problem, just delete the region and then recreate it. So, if I just switch over to... Okay, I'm getting a message you may have lost today, but I hope not, if not, hopefully you can read my lips. So, the reporting, as we saw, we can specify count. You can also specify an interval. If you specify either one, it will pause the reports to repeat. Otherwise, if we just run the report, you'll get a single snapshot of the current values. Just very quickly take a look at latency histograms. These are one of the most requested features in BM stats. It took a little bit of time to get the kernel side of the support correctly worked out and merged, but this is now available in the current device mapper and LVM2 releases. To specify a histogram, we give the bands or the bin boundaries. So, we've got an example here with 10, 20 and 30 milliseconds. What that is doing is creating four buckets in our histogram, one from 0 to 10, one from 10 to 20, another from 20 to 30, and the final one, which is for everything 30 and above. These will then report either as a relative or absolute count when we use the histogram option to the reporting tools. If I just switch back to a terminal for a moment, allowing you to carry on and create new regions with new definitions. In future, we've got a number of feature requests being worked on at the moment to add grouping and aggregation so that you can more flexibly combine different statistics. Direct integration with the LVM tools and an automated tool to detect hotspots, and hopefully also in the next batch of updates will be a sort of real-time top style display for the statistics information. So, I'm sorry, I've run on with my talking as usual, so I think we've got about 30 seconds maybe for questions. Do feel free to pass those over. They'll be relayed to me via IRC and my fearless video assistant Mark Clizir in our Farnborough office. So please go ahead if there are any questions. They've lost sound again. They're searching to voice them and collect questions. Should I stop and start? They've not said to stop and start. I don't honestly know what is up with audio. I'm recording the good, a good level here. It may have been that and, but I'm not sure. I just think everyone's walked out by now so there won't be many questions. Yeah, do bear in mind that's a little hard. I know. I wasn't going to start picking my nose. I'm picking a little wipe, but it's fair. Bear it up. I'm not sure if he wants to switch to optical. There is a boogie in the link that Alistair sent out earlier but I can't connect to it while I'm sitting there with my arms up, should I? I kind of feel like switching the video off. I'm going to switch over to the other camera.