 Hello, I'm Brian Gurney, a senior software engineer. I work mainly on video virtual data optimizer. But I also had a side project of a block storage test device. So quick question here. Who knows what a block storage device is? Pants. OK, about half of the people. So a block storage device is any device that can read and write data in blocks of a given size. Usually, classically, it's 512-byte sectors or a multiple of 512-bytes for some interesting things. There are some devices that are 4 kilobyte native that could cause specific bugs here and there. So examples of block devices are hard disk drives with rotating platters, solid state drives that are based off of some non-moving chips, flash memory, NAND based, or even now, 3D cross-point based, or any kind of medium. And tape drives, which we will not talk about, but we will fondly remember for all the troubles they give us. I used to be a system administrator that had to deal with tape, so I had to put that there. So what is a block storage test device? A test device exhibits some kind of behavior simulating a certain failure mode or performance problem or some other condition that may instigate bugs in other drivers or software or applications or scenarios that a customer might run into. So some examples of block storage test devices that we have in the Linux environment right now. The Linux kernel has one that's called SCSI debug, which simulates the behavior of a SCSI disk down to the protocol level for the T10 SCSI behavior, so things like your reads and writes, sense codes, and there's some device mapper targets. There's one, I don't have it listed here, but there's one that's called DM error that always sends errors for every block of the device. There's some cases where you switch to error if you want to make sure you stop IO to some specific device. DM flaky is one that was added that simulates between error and linear at a specific IO period. You can specify the frequency and whether it does that on all reads and writes or just reads or just writes. There's also DM delay, which simulates latency at a specific reads for either both reads and writes at a specific time period. So there's some other more advanced test devices that the video development group has worked on that I will now reference here. They're internal right now. Hopefully someday they can be open sourced. So one of the test devices is a very simple one that sets the rec foo a bit on write bios at a given frequency, which is input into, I think it's a table parameter. It's either a table parameter. Yeah, I think it's a table parameter. And so in this case you can see if the data direction is right and if the frequency is not zero and all this other of the things, which basically the specified what frequency it is, then turn on rec foo a, otherwise turn off rec foo a. And for extra fun change that to rec flush and then send 200,000 flushes to your device. I did that once for a test to see what would happen with the drive, how it would behave, and then it returns the remat. So there's another device that was used a forgetful device, which honors flushes but simulates the behavior of lost writes after a crash and you can also program a partial cache, a forgetful cache with a bit mask and a modulus. And so there's a status field in SysFest for this target, where it'll actually list which blocks are dirty in the cache at a specific time, which you can tell it to stop and then simulate a crash. And so we've had problems before we had drives that somehow tear writes and we don't know where they were torn, we don't know when, until after. And this helps so you can simulate the pattern and find bugs this way. There's another test device. There's a tracer device. This may look like a foreign language. This is a block trace. So this is time. That's process ID, I think. Cued, completed. So this is write sync that are queued and completed but in between, there are messages where it is reporting the data hash of the block. So you can know if the data has changed from a specific value or not. So with the forgetful device and the tracer device, we were able to find a bug in XFS in late 2017, which helped, I believe it was Brian Foster who fixed a bug in the XFS log. So there was a test that I was performing on a video volume and the drive developed a bad sector. And that triggered an issue where you can see here there's the read, it reports an IO error on that specific sector, an unrecovered read error. And there was an issue with the index where it got the IO output error, the index stopped, the status still reported it was running. And so when I filed the bug and then the question came, well, how can you reproduce this? And my answer was, well, you'll need to drive with a bad sector and you'll need to know when it appears. So I asked myself, how can I do that? So with the existing tools, I could reproduce it with ScuzzyDBug but only if I were to simulate failures on reads for sector zero, it's one, two, three, four for 10 sectors. But the, see, those values are hard-coded in the source so I would have to recompile the kernel if I wanted to move that region. And the storage is based on RAM and video, to create a video volume, you need about five gigabytes for metadata, which is larger than most file systems where you can create test devices that are in the 10s or 100 megabytes. And the medium error flag, you can only have it on all the time or off all the time. You can't set when it starts failing blocks, which means it's not easy to lay down base metadata for a test. So then I thought, how would I improve on that? So I used the Fua target from before, which is relatively simple. And I started creating an idea where there's a device that simulated behavior in bad sectors, and the general goals were to choose arbitrary blocks of fail, enable that at a point in time, persist data across reboots, emulate remap sector after rights, and also not have to change the table, which may change elements of the test. It may not. Admittedly it was sort of the table change was a little unfamiliar, but there was an interface in these test targets where you can send a message to it as an input, a command input to the target. So I started a proof of concept of this target. So basically if the test sector is 42 and it's a read, and then if I have failing on the bad block, then return dmmap.io kill, simulate a failure. But that's only if it's enabled, if the device is enabled. So later I was able to find somebody, I sent a message to support, and to somebody who was working on one of the other in-tree test targets about the idea. He said, that's great. I know how you can use the arbitrary, and I can send you patches for this. I said, yes, please do, because this would have taken me a while to program. So there was a list that I wanted to build for the bad blocks, to be able to arbitrarily add and remove through a message interface, where you can send a message to the device, dust1, the zero, that's the offset to indicate which one it is. Usually I'm creating one so it's always zero. And then add bad block and remove bad block. So it's a relatively easy to understand interface. And normally the device is passed through until you want to enable the point of failure. And so there's an interface where you send a message to enable it, and it'll report it on the kernel log message so you know what time you started to enable things. So you can see that alongside the messages of your driver, your application, to see when its failure messages may appear. Which is very important because the time of when things happen can add context of whether the test case passes or fails. And then there's a DM step status that will report the status, fail, read on bad block, or I believe it will say bypass. And so failing reads. So if we have a dust device with sector 67 in the bad block list and enabled, then when you try to read from the block, it'll report the application will report input, output error, and then you'll see zero records in, out. So that is a case where you successfully emulated of a bad sector for a read. For writes, this device will remove the blocks from the bad block list when you write to it. That emulates the behavior of a drive remapping a sector. So then subsequent reads will succeed. So instead of a simple target that unconditionally fails reads, this will emulate the behavior of a drive to simulate that kind of scenario on an actual drive. So in this example, I add 60, 67, 72, 87, and then enable. And in the SysFS, you could see it will report when those blocks are removed from the bad block list. Drives are not so nice. They won't tell you when they're remapping sectors. In SMART, they'll tell you how many. They won't say where, and they won't say where they were remapping them to, but then for classical knowledge, you'll know for drives. When you start to see remapped sector count increase, then you know you should probably replace that drive. But in this case, you're trying to simulate that condition and you don't wanna be there with an old drive not knowing where the sector is going to fail because then it's hard to reproduce a problem. But with this message, you can see exactly when and which block is removed. This can be disabled with a quiet switch because I thought, well, there's a case where you could have many of these blocks and then you don't wanna have spam in your log. So then this will quiet down that message. So you can do large tests with thousands of blocks that are in the bad block list. And this was released to upstream. I wonder if I can bring the, there it is. So I sent this, I believe, on January 7th. This was the email that I sent to DMDevelop, the device mapper development list for upstream and the general description of the device and the contributors who helped me build the device. Which then afterwards I got remaining work to do where I could, that's the back and forth of developing an upstream target where they said, well, why are you doing an arbitrary block size for the device? It was setting maximum, or minimum myo size, optimal myo size to the block device size, which was 512 or 4K optional and then could it be an arbitrary size? So I'm working through those now to find out the next steps of like, when I tried some of those I found an issue where it was only failing the first sector of the block and then I said, okay, what do I have to do now? So that's the development process where you have to say, keep refining, make sure the thing still does what it's supposed to do but make it acceptable to be brought upstream. So that's the mailing list that it's on is DMDevelop. And any questions? As many as the device has. I don't really know, I believe, the other thing that's in my mind is how much memory does the list take? It shouldn't be that much, but it will be an amount that will grow. There's no limit to that, but you can count how many bad blocks there are. Oh, the failure, okay. Yeah, there's no simulation of a failure of remapping. There's a question of what would that behavior be in terms of a device mapper target? It may just leave it in the list and it may return an EIO on the right, though then there are questions of how would we simulate that and or what other parts of device mapper might have to be changed to be able to do that. Which is that'd be a good thing for a more advanced target. It may be outside the scope of what I can do. I was actually looking for, when I first wrote this, I was hoping I could send ENO data, but there's not a provision, the DMAPIO kill message was put in there to replace an arbitrary error message with an older kernel I was trying to send ENO data, but it still appeared as EIO, so I backed off and I said, okay, I'll just leave it for now. So those are the limitations that are there currently because this may be sort of a new frontier of creating more advanced test devices on device mapper and this is sort of interesting to say what cases do we need to test and so that's one of the things I was thinking about is what if a remap fails on a drive? Then what do you do? So from the application side it'll see EIO. Yes? Okay, okay. So for the microphone that was Heinz Mauschaggen, he said it may be possible with some sort of a counter. Yes? Yeah. Oh, okay, yeah. Because that will be a different SCSI device, like a real device, a real tape device will be the ST driver instead of SD for SCSI disk. So then there's a question of how do you simulate those other parts of that kind of device? That may be something like SCSI tape debug, which is, but that's sort of an interesting thing in terms of like building a test device for more loyal emulation of that device. Yeah, in the original block size, I was targeting the classic block sizes of block devices, 512 and 496, but yeah, there's definitely, I have to test to make sure it still works, but there's questions of like, could it be 32 kilobytes? Could it be 128 kilobytes? Though then there's a question of, do those devices actually exist? Any other? Yeah. Okay. Yeah, using DM switch to have a linear and then a dust and then maybe an error. Yeah, yep. Yes, 512 emulation, yeah, yeah, yeah. For the microphone, he was saying that a lot of drives do 512 emulation now. The access pattern is still 512, but then behind the scenes, it's actually running in four kilobyte blocks. So then you might have that entire 4K block affected by a remap, which is kind of frightening in terms of what happens with the remap block because the other layers above may not know, but there would be a sign of at least some problem. I've seen them with both, yeah. Any other questions? Okay, yeah, I'd definitely be interested in looking at a limits target. So, thank you.