 Hi everyone. I'm Brian Gurney. I'm a senior software engineer for Red Hat, primarily working on block storage things like VDO, virtual data optimizer, and now Stratas, a volume managed file system package. During all that storage work I have to do a lot of block storage examination and diagnosis. First, how many people know what a block storage device is? Okay, pretty good. Most people. Yeah. So block storage device is a device that's accessed by certain sectors or blocks instead of a bit wise fashion. So it's hard disk drives, solid state drives, heaven forbid, tape drives, probably not. Yeah, there's hard disk drives can have random access more than tape drives can, and solid state drives are solid state because they don't have to move mechanical arms or anything. And so they have better random access performance. So what do block devices do? And so the main goal of any storage is to be able to write something down and then read it later. If it fails at that job, it's not really long term storage or long term memory. And so the main types of accesses to block devices are classically reads and writes, and then later on disk cards, which are a special class for things like thin provision devices that provision more space above than they are actually providing below, and also devices with some sort of a translation layer between what it presents and what's on the device. Flash drives, NAND based solid state drives, 3d crosspoint, like those kinds of drives. When you send, when the initiator sends a discard command, discard request, that tells the device, you can free what is underneath that device now or what is underneath that block now, because the initiator no longer needs it. So then it can do unprovisioning of that block, it can do garbage collection behind the scenes. For a solid state drives, this is very important for performance reasons. And so, but most applications don't directly access the individual block numbers of a block device. They would have to keep a database of what data goes into what block, whether it can fit into a 512 by block or it needs more of them, all of that kind of stuff. And then you get into things like fragmentation and the size of that such a many database. So most of the time, you have higher level applications than the kernel that create some kind of abstraction to store things on a block device. The most common one is file systems, ext, XFS, the fat and Windows hide and TFS, things like that, where you can create a file in a folder in a more human readable, less complex to think about a hierarchy where you can store data, not have to worry about how many blocks it goes in, let the file system take care of that. Similarly, object stores provide a way to say this blob of data that is this arbitrary size has this ID, and then store it there. And then the object storage driver will return and say, okay. And logical volume manager is where you don't have to worry about provisioning things ahead of time. You can actually transparently add devices to extend the logical volume and not have to do the migration of data off the old thing where you had to have both old and new devices connected to one system and then manually shuttle data from one to another to make things a lot easier to manage storage devices. So a basic level programs access blog devices via reader write operations, which I have here shown by this simple DD command to write a zero block to device NVMe zero and one block size 4096 count one. I always forget whether it's secret skip zero in this case. So it's directly writing a block and the file system keeps a metadata index of files and their block locations. And there are programs where you can actually find out where what are the blocks that actually contain this file. And for ext, that's the file frag dash v command. And so you can see block logical offset zero to 4095 is in this case, this file is 16 megabytes. And so that's it's in four kilobytes 4096 blocks 4096 bytes. I like how it says that there. It says how big the block is. And so you can see it's on physical offset 1167360027695. And so the file system takes care of that. But if you had something like a driver reporting that there's some kind of problem with one of those blocks, you may be interested in that, and you have to find out how to get from here to there. And XFS has a similar way where you could say those block ranges, those allocation groups. But then you get into a case where you have some kind of driver reporting an error block 11677694. Well, what happened? That's the classical question in system administration or diagnosing bugs. And sometimes you need to get to the lowest level to find out what's happening with block 12647, whatever. And so in the block layer in Linux, the way to do that is called block tracing. And there's two basic programs that are called BLK trace and BLK parse, usually just say block trace and block parse. And so here's the basics of a block trace. This is the text output from block parse. The block trace command runs a command on the data on the block device and outputs data files. And so block parse will output these columns. And this is a little bit of an eye chart. But the basic columns are which device is it in this case, eight zero. And we can see dev SDA by major minor. What CPU did it happen on? This machine has 24 CPUs in this example. So and there's a sequence number per CPU. So then it's monotonically increasing for the individual CPU. What time did it happen in seconds and nanoseconds? And nanoseconds are important for very, very fast devices like non vile tile and V dim devices because they can have latency that approaches hundreds of nanoseconds or under the tens of microseconds. So that extra precision is really necessary right now. And also you could see a process ID of who did it. Some kind of action of what happened. These comments are actually in the source code for block trace. What happened? Who did it on what CPU did it happen? And so the main core of a block trace is the action and what is called RWBS. And so was there was a read write or a discard and the action is what stage was that request in? And then there's extended information which for or the sector location the sector plus offset is standard information. So that will indicate the sector and the offset in the length. So the offset from zero, zero would be the lowest block. And then plus 64 would be 64 sectors long. So that would be 32 kilobytes I believe. Yes. So 32 kilobytes for each of these requests. This happens to be a write, a synchronous write. You can tell it from WS. I love the decoder car for all those in slides that are coming up. But as for running block trace itself, the basic process for it, which you'll have to have root access, if you need to, if the following command fails, run mount dash t debug fs debug fs slash six says kernel debug. And then after that you run the block trace command with dash D device on the specific block device you're measuring. If it's a partition, then measure the root block device because you'll want to track what those remap operations are from a partition to the block device or from a logical volume or device mapper device, a bio based device to the actual block device. And then once that's saved parse the binary block trace files by running block parse dash I, there will be one file for each CPU in the system. So if you have 128 CPUs, there will be 128 files. The size of them may vary because some CPUs will get more requests than others. But in the block press command only specify the first one, it'll find the rest that have the same common file name. It'll be zero one two three four up to 23 in the case of my test system. So that decoder card for all of those actions. These are the most relevant actions for how a request happens. So when a request happens, whether it's a reader or a write, if it's on a partition or a device above the device, there'll be a remap event that will say from what origin sector and offset did it come from and then what map origin sector is it mapping to. If there's multiple of them, there'll be a chain of them perhaps to say if it's a logical volume on top of a partition on top of a device. And then after if after any potential remaps, the request is queued. There is a get request structure quest allocated. These are all things before the request is inserted into the queue into the any it could be multi queues for the Linux kernel five own after is definitely multi queue and then the earlier ones before it depends on the device type. So the request is inserted into the queue dispatched or achieved to the driver and then completed. That's the one where you know that I owe completed and you'll see a status code as to whether that request succeeded or failed. And for bio based devices, you'll see less of these events a QC. So this progression of actions is usually what you want to see for the for a single request a QG IDC. It makes it easier to memorize sort of that progression of actions because that's a standard progression of looking at a at a block trace. So I have some quick real world examples of see. So that's what a short relatively short block trace looks like. And so you'll see multiple events happening. And you may see some events in between happening. Yes, you'll see a QG IDC for 22403 200. Keep in mind, you may see other sectors, other sector locations it deleted in between because there may be multiple requests happening at one specific time. And you may have to search for things. You may have gigabytes of trace data. You'll want to find one specific event so that you can use for you can use tools like grep to search for a specific block address. So the RWBS category, those are the operation flags. Classically, it was read write barriers sync. But after kernel 2637, long ago now, because we're up to five to now, the block barriers are replaced with flush and Fua, which flush allows the entire device to be flushed. And for all rights before to be satisfied, Fua is force unit access. That means that is a specific request can request to write directly to the device, not go through any cash. This is commonly used by file system to directly write metadata and be sure that it's getting written directly to the media, not being cash just in case there's a crash in between so it can write as soon as it can to the medium. And those get higher priority than regular rights. And those take more and more strenuous to do than regular rights because they can't really go through the cash. But most of the time critical because they want you want to be written on as soon as you can. So afterwards, the order in which you see the events are if it's a flush that F will be at the beginning, and then it will be one of either discard, write or read. And if it's neither of those, then you'll see a capital N, but that's only for driver plugging and unplugging. And then after the read, write or discard, you'll see force unit access F read ahead a synchronous S or metadata M. Metadata is a very new flag that's within the last two years, I think, forget the exact version it's in, but that's newer, newer than the other three. And so the caveats of block trace, as I said, it's possible for multiple IOs to be performed on one storage device at any time. So be ready to pipe the output of block power to less or pipe it to a text file, search through it, rep through it. I had a test where I was doing a 16 gigabyte block trace file. And that's quite a lot to do. You definitely want to pipe that to some file before reading through it. And high volume tracer in runs should save on some kind of separate storage that is ideally not the test device and fast enough to handle the IO. The demands are not that bad. But when you block trace, there'll be a report at the end that'll say how many events it drops. And if that's on zero, then you need to need to either get a faster device or adjust the buffer settings of block trace, which are relatively easy. And one other caveat in the block trace, you'll see that the timer starts on the first event. The first event says zero for the timer. And that's, that is not the time that you started block trace. That is actually the time of the first IO. So that's actually that can be good or bad depending on what you want the the time index to be. Most of the time you want the time index to be the first IO, but if you want it to be synchronized to a specific time that you may have to coordinate that and adjust relative, find out when the actual request is being submitted. So, so the remaps, the A records for the the block trace, you'll actually see a remaps can occur from either a partition, which you'll see from some sub device. And in this case, it's a SCSI 8-2 dev SCI 8-2. You'll see right here there's a record for 8-2 sector 221903800. And so if you look from the top to the bottom, you'll see the sector progressions from top to bottom. And you can actually see that a record for 3800. That's actually for the device below, not for 80. But you can trace from top to bottom for the actual record. But the partition, or in this case the device mapper, the LVM logical volume 25030, that block number is what you'll want to look at for your application using that specific device. So keep in mind the device stack, whether you have a partition and then a logical volume on that partition, and write down the major and minor devices for each device. And so while writing data, the scheduler can actually merge IO together to make one larger IO for, say, an IO bound device to get better performance. And so the most common IO operation is a back merge, where you have, say, a write to block 57520. Lots of numbers there. But say you have like block 0 and then block 8 and then block 16, the scheduler can actually merge the one back with the other one, like there's two 4 kilobyte writes to create an 8 kilobyte write. And then if there's another write that's coming in, it can merge that one adjacent to it to make a longer sequential write. And so you'll see a progression of AQG, AQM, AQM on and on, AQM, AQM, and then it inserted into the queue and then eventually completed. And you'll see a larger sector count at the end. So in this example, it's 11 4 kilobyte write requests being merged into one 88 sector request 45,056 bytes. And so if you see, you will not see a one to one correlation between the A and Q events and the completion because the schedule is merging them together. And there's another other behavior that can happen is called splits. So this happens if there's a device that has an optimal IO size that's non zero, say it's four kilobytes, the scheduler will have to split that IO into multiple parts. And you'll see an x record for that with a start in an end sector number at the end separated by a slash. And so each of those will have an individual completion record. I have the first one here. There are other ones afterwards. And so IO errors, this is actually very interesting because since the completion event has an IO error code at the end, you can find what a device fails. So in this simple test, we have a device called SCSI debug, which has an option x zero x two to enable failures on a specific sector. And so that created a device dev SDB, which then I created a command to write 46 56 the the sector that fails is zero x one two three four for 10 sectors. And so when it tries to write it, DD reports the error, no data available. And you can see it here. It says 65 475, which would be minus 61. And if you look in the error, no dot h header, you can see 61 no data available, no data. So during a trace, if you have a completion trace event that's non zero for the status code, that may indicate that there was an error somewhere. But usually you're coming from the other way around where you'll see your application report some kind of error, like usually you'll be EIO IO error, you know, data for this specific driver. And so then you can find that specific event when it happened. And there's an extra option in see, I didn't list the command here, it's dash T block parts dash T, which will let you track IO latency. And after the sector offset plus length, in this case, zero plus eight, there's an extra field in parentheses, that is the latency and nanoseconds between each of the stages. The for the i record, it's delta since G, we get requests for the D record dispatch, it's the delta since the insertion into Q. And for C, it's the completion latency since being dispatched into the Q. So you can see here 32,000 microseconds or 32,000 nanoseconds, 32.8 microseconds. And this is a three three cross point NVMe solid state drive, which I expect to have that kind of latency, very low latency. And you can actually take this data, and you can graph it. So if you filter on just those C records, and then look at the first ones, you can see these fields with parentheses, and then filter it out with some more, there it is, TR, AUK, and then I'm listing the first five ones. You can create an XY graph with the time of the trace, and then the latency of the IO. And then save it into a text file, and then import it into a graphing program, and then create a latency graph like this. So this is not just some average or percentile, like 99th percentile. You can actually see patterns here, where you can see bursts of latency on specific runs. Then you can see the latency go down. You can see time intervals on the drive of the specific IO pattern. That could actually indicate things like a drive's microcontroller behavior, or if you see some strange periodic latency behavior, or a strange 99th percentile high latency reading in FIO, that might be related to some kind of time-related pattern. And this makes it a lot easier to look at thousands or millions of data points. And there's another program called IO Watcher, which can create a graph or a video of IO activity from block trace data. And so it gives reads and writes, locations of reads and writes, throughput, and latency. And I have an example here. So this is a IO Watcher run of a video volume that's writing to specific areas. And during this activity, you can actually see writes in blue. And right here is the first sector of the device, sector zero, which then chases the right and then goes up one line in a raster fashion, all the way to the end of the device at the top right. So the right activity we can see here, we have the UDS index writing sequentially. We have data slabs writing here. There is a, see, that's the slab summary of the slab summary. Slab journal reference counts. Slab journal, yes. That's the tiny, the smaller slab journal. Each of the slabs, I believe, in this video volume is two gigabytes, the default setting. And then at the very top is a circular log called the recovery journal. And so when this is played, you can see behind here the block map is writing in what looks like a checkerboard pattern. That's actually a radix tree that's spaced out by a certain interval. And it's filling with a sequential write. And then those same blocks are being written by a random overwrite. You can see a large burst of activity across a very large area. And I know this ahead of time, that's actually something called the block map period where it's updating the entire block map. So you can look at large scale patterns. This would have required, I forget how big this trace file was, but it probably would have been millions of trace events. And instead of trying to correlate those numbers in text as they are, you can visually represent them in a compressed time. And you can start to associate patterns, look at things like QDef and throughput. And Iowatcher actually has, I want to try and contribute to Iowatcher to fix a few bugs here and there. There was actually one of these videos, I think it might have been, it wasn't this one, but it was one about the same size. It was a 16 gigabytes set of block trace data. It originally took 12 hours to build, to build a video. It uses Aug Theora and SVGA, part of SVG, PNG to SVG, to create the graphs. It's multiple SVG graphs that are then rasterized and then put into an Aug Theora video. And it originally took 12 hours to compile, but then after a bug fix, it was put to 10 minutes. It was because of various changes in the block trace data. I believe it was, it doesn't count Q events, but these are the kinds of scale issues that happen with building the video when there are so many events that occur. So that's Iowatcher. It's a very interesting tool. It was originally, there's an older one called Seek Watcher that was Python. This one, Iowatcher is C++, I believe. I believe it's Chris Mason who built it originally. So there's other block trace related programs that come along in the block trace repository and in the block trace package. Btrace is performing live tracing for block devices. It's a shortcut to running block trace to standard out, and then running block parse using standard in. There's some caveats to doing that. You can live trace, but I would not recommend doing it for heavy, heavy loads, because there may be some time discontinuities or dropped events because of instead of buffering to a storage device for the test files, you're putting to standard out and then standard in. There may be issues. Most of the time, you'll want to trace, you don't really, the priority is more to have the log data stable than it is to see it live. So you can try Btrace, but I recommend using regular block trace to a file for most testing purposes. And then there's some other utilities, verify block parse for verifying the time ordering about from block parse, which is not necessarily guaranteed for some runs. Most of the time, I believe for file output, it's consistent. I've encountered some areas where it's not for graphing. That's why I usually use the time value in the events to graph. And then there's some programs called BT record and replay where you can actually replay the IO activity as captured by block trace. I get to try and use us some time. They're not going to have the data because block trace preserves the time, what time type of operation and the sector offset, but it doesn't know what the data is. That would take far too much data space to save what the data was in the block. And BTT, let's do analyze block IO traces produced by block trace for various statistics. It's another accessory program. So in summary, block trace and block parse are the lowest level programs for block devices to find out what's going on. And it's expressed in a relatively easy to parse text format. It looks very intimidating, but it's very, very greppable. You can use grepset and octoprocess, the basic text events, you can export it to other formats, you can export it to graphing programs. There are other programs that can do filtering and there's a new patch to block parse that can filter by events internally in the program. So you don't have to do grepset and octoprocess. And large amounts can be analyzed. It all depends on how much space do you have, how much time processing time, but it's relatively easy. So any questions? Yes. So the question was, in relation to performance, is this sort of system level or so it's this is the lowest level you can analyze at what is going on at each individual sector. It may be looking at a microscope at the tree instead of looking at the forest. So there may be other tools where you'll want to look at if it's a general performance problem. You may want to try and look at block trace if it's happening on specific sectors. That's actually happened before with a case that I have once where a customer had multiple devices concatenated in an LVM logical volume and they were wondering why the part at the end was slow. And they knew that the first device was slower device, but that actually turned out that that device was at the end. And we could see that with block trace would have seen it with the latency times there. Though there was another tool called DM stats for device mapper stats where you can actually look at specific regions and get a histogram of the latency or the throughput for that specific region and not the entire device. So you'll want to use a specific tool based on what granularity you're looking at and block trace is the lowest granularity. It's bisector. And then to get those to aggregate those stats you'll need to do take averages of those numbers which other existing tools do. But then in this case if you're looking at specific IO from specific programs or IO specific regions or flush or foo sort of activity, then you'll want to you'll want to catch those flags and watch that specific behavior. So it all depends. You'll know the contexts. You'll have a good hint at the context as far as how what level of detail you want to check. Like it was for the whole device and you could use IO stat for that kind of thing. But then if you could simulate the IO with FIO and then use its histogram reading for that, or you could use DM stats if it's a device mapper device and then you can have histograms printed out by that. So I hope that helps. Any other questions?