 We have Ken Westerbeck from Toronto, Canada, my fellow Canadian presenting on from blocks to file systems to booting, how OpenBSD makes bags of blocks useful. Thank you. So first I want to say don't be too scared by that number 64. One, we have lots of time and two, there's lots of pictures involved, so we should be able to get through on the normal amount of time. What I'm going to talk about today is the various levels of usefulness you can extract from a block device and how OpenBSD can and the tools you OpenBSD uses to achieve those different levels of usefulness. We have basically four brief introduction, go through gradually build up the higher levels of the usefulness, brief mention of various ideas that have been suggested that we might look at improvement in the future and conclude. So what we're going to be discussing is data structures that are used to tame block devices, in particular the disk label which describes the partitions that contain file systems, the GUID partition table or the GPT as it's called, the master boot record, the MBR and the partition boot record, the PBR. We're not going to talk about that one because it's mostly to do with floppy disks. We'll talk about some of the kernel functions that use those data structures. There's machine or architecture independent ones that apply in all the systems that we support. ReadDOS label and check label are the ones that actually read the disk label and validate it or construct it. There's machine dependent read write disk labels that actually on each architecture write the disk label. This talk is mostly about the machine dependent ones. The machine dependent ones are all almost identical but there's three or four that are different and we're not even going to bother trying to address the differences. And there's also device entry points for the various device drivers that provide block devices, in particular open get disk label and some IOCTLs. And finally some of the user programs that are necessary to manipulate the various data structures and to sprinkle, excuse me, sprinkle the pixie dust necessary to actually boot. Not going to be talking about extended MBR partitions, not going to be talking about other or multiple operating systems, CDs, networks, DVDs. Or as I said, the peculiarities of Spark64, MacPPC, HPPA and Alpha, all of which have their own very unique approaches. A couple of important definitions to keep in mind. Throughout the kernel, there's a block, which is a DevBee size, which is defined to 500 and 512 bytes. The kernel sees all devices as a collection of these blocks and uses a data type called the Deatterty to address those blocks on the device. Actual devices usually have the same size block these days, but there are many in an increasing number that have a larger block size or different sector size, up to the other popular one being 4096 bytes. And the partition is a contiguous sequence of sectors, which can have different meanings at different levels. So if the kernel actually detects during its probing process a block device, user land programs can abuse it without any further configuration. There's a sysctl, hw, disk names that will list all of the block devices that have been discovered and their unique DOIDs, if in fact there is a DOID on the disk to display. You can use that DOID, which is portable across whichever place you happen to plug in the disk, in most places that ask for a device name. The most common place is in the FSTAP entry, where you can specify your partition L in this case, that's not a 1, that's a .L, where you're going to mount that particular file system. There are in fact six block devices, CDs, which we're not going to talk about, floppy disks, we're not going to talk about RD, which is the RAM disk that you use during the boot process or the install process, sorry. We're not going to talk about SD, which are SCSI devices, which is the vast majority of actual block devices that appear. VND, which is the device that you can use to create a block device out of a file, and that's used to build various images during the booting process, and WD, which is the old ATA interface. Increasingly, thank God, rare. Enough information is provided by the block device for the kernel to construct IOs. The number of sectors that the device contains, the size of a sector, because the kernel has to translate between the sectors and blocks, because it thinks of blocks, but the device thinks of sectors. And it creates during the pro-process and getting the initial disk label a raw partition, which is traditionally C, and that is just a single partition covering the entire disk, from zero to the last addressable sector. So that information by the block device is provided in a struct disk label, which is 512 bytes long, and can actually describe up to 16 partitions, which we'll get into later. In order to obtain that information about the block device, there are three useful IOCTLs. It's a GP Forget. Physical, I think, is what the P stands for. That returns the basic default information, the number of sectors, the size of the sector, and the raw partition. G, it just gets the information that has been most recently cached by the kernel, and RL reloads that cached information. GP and G are obviously the ones that are most often used. If you wanted to actually use that block device, you would use pseudocode something like this. Being an OpenBSD program, you would, of course, pledge the minimal number of POSEC subsets that you're going to use in your program. You would unveil, which is to say, restrict IOs to the raw device. You'd open it, you'd get the disk label that tells you the parameters that you have to work within, and then you reduce the pledge to the absolute minimum, and then you go into your work process. You just can L-seq, you can read, you can do stuff, L-seq, write. You have the whole disk available to you. And, of course, that means you can write a touring machine, which means you can compute anything. cumbersome, but theoretically possible. Eventually that becomes a bit cumbersome. And when that happens, you can abstract all of those blocks as the kernel sees it into a file system. Creating a single file system utilizing all of the sectors on the device is fairly straightforward. You just knew FS, the raw device. You mount it to some mount point. If you wanted it to be automatically recognized, you can edit your FS tab. And then you can do whatever you want. Job done. Well, as far as accessing the blocks, it is. But the device may have a GPT and MBR or TBR that has information on it, subdividing that disk into potential file systems that you may want to continue to use from OpenBSD. So the get disk label function in the MI area, or the disk drive area, calls an MD function, read disk label, which calls the MI function, read DOS label, which is the key function. And that checks the device for a GPT and MBR, PBR, and adds to the default disk label up to eight extra partitions reflecting what it found on the disk. This is useful if you're handed a disk. It has various file systems, MS-DOS file systems, or whatever that you want to boot and then send out to someone else, but you don't have to write anything to the disk. You just use what's there. The OpenBSD partition, if it's an MBR, is it A6? Or much easier to remember, GPT, DUID? Are not spoofed. If there is an OpenBSD partition, that's handled separately. So what happens is that the sector zero is read. Read DOS label function first checks to see if it is in fact a GPT. And it first checks for a GPT because that's the most definitively defined structure. It's got check sums and all this other stuff, which the other options do not. If it finds a GPT, it just starts reading the partitions and adds from I to P in the disk label up to eight partitions. If it doesn't find a GPT, it tries testing whether it's an MBR. If it finds an MBR, it has up to four partitions. It can add I to L, again, ignoring extended MBRs. If it doesn't find an MBR, it tries, lastly, is this resemble a PBR floppy format? And if it finds one, it adds a single partition covering the whole disk. In the spoofing process, it's creating partition descriptions using these fields. It pulls out of the GPT or MBR a type. And, you know, they have various kinds in either format. It builds two 48-bit sector values, which is to say they're offset on the disk and the size of that particular partition. And we have two defines, two pairs of defines to get the offset or set the P offset of the partition. At 48 bits allows many sectors and about 144 pair of bytes if you're using 512 byte sectors. Now, we ended up with 48 byte bit, 48-bit sector values because we scavenged other fields in the disk label itself to find an extra 16 bits per 16 partitions. So we had a bunch of extra information. There were some spare, deliberately left spare areas. There were some other fields that we got rid of. We found enough to expand the partition size and offsets to 48 bits but not enough to go to like 64 bits. The kernel is capable of addressing up to 64 max blocks addresses so it can currently handle anything up to and well beyond anything a partition can currently describe. So to get those values, it either reads the partition information from the MBR or reads the partition information from the GPT. In the MBR case, you have a DP type field to a little NDN U32 fields as a start and size. These are sectors, not blocks. And for the GPT, you have, again, a type, a different set of types. They have 64-bit little NDN sector values. So you can, in fact, in the GPT describe a sector that can't be mapped into an openBSD partition but we haven't yet to encounter 144 petabyte disk drive as far as I know. If you want to initialize a GPT or MBR and you use the program fdisk and you can do dash j dash g or dash i for GPT or MBR. If you just want to display it, you just say fdisk and then the unit, and if you use dash v, you get a little more detailed information. In particular, it will show both the primary, the secondary GPT, if they're present, and the MBR so you can tell exactly how everything has been put together. If you wish to edit, then you can use the dash e command. Over the last few releases, there's been a fair number of changes and enhancements for reasons that we'll get into shortly. We recognize and can display the names of more partition types which have been appearing and confusing people. We've had to specifically create a specific list in fdisk saying, don't touch these partitions, you don't know what they're doing. And recently we had to make even more enhancements there. We've had to relax some overly paranoid GPT validation processes in particular around the size of the disk because the GPT has the primary at the beginning of the disk and the secondary at the end of the disk and we were checking to make sure it was really at the end of the disk but if you copy the disk image onto a larger media or vice versa then that would be cut off and we were rejecting that but we've had to accept it. We've enhanced slightly the display of the types by saying this partition is Microsoft basic data instead of just printing off the first file system type that matched which happened to be FAT 12 and that was confusing people. We did the inertial implementation of GPT on AMD64 and missed one conversion of the little indian to big indian when we were writing it so when we started working on the Apple M1 people got a little confused as to why the check sums weren't working. If we wrote them it was okay because we did it wrong if we tried to read someone else's it didn't work so well. We cleaned up some help functions where the long list of partition types was getting a bit too long so we separated out GPT and NBR partitions. We no longer allowed geometry editing which was an interesting feature that caused a lot of confusion and complications in the code which was more concerning to me. We recognized and displayed GPT partition attributes. The only attribute that was present when I implemented the initial GPT support was the equivalent of the active DOS active and I did that wrong but no bias or UEFI paid any attention so that's okay but we fixed that. We found that there were several new ones now so we now display the attributes for debug purposes and nothing else with the dash V option and we changed the values that the user could input during the creation of the GPT or NBR to be blocks which made more sense from the kernel side and was a consistent size instead of sectors where if you had a script that said create a partition a thousand units long one would be eight times larger than you were perhaps expecting if it was on a larger 4,096 byte sector device. What we found just recently last week was that one of our developers turned on an ARM64 device discovered that their default configuration had 49 GPT partitions all of which it needed for booting except the 49th which is the one we wanted to spoof so we recognized that they were in fact kind enough to set the required attribute so we now no longer try to spoof in those eight partitions required partitions because the required attribute is meant to indicate this is needed by the hardware to boot so we're not in any way shape or form knowledgeable enough to go in and change or affect those partitions so we don't want to waste time spoofing them so that's working on a device that has multiple partitions but still we haven't written anything openBSD specific on the disk if you want to have something like soft rate RAID 0 through 5 whatever options or swap space or the best file system ever the FFS file system you may want to have more than eight partitions and remembering our 49 partition friend you may want a different set of partitions than the default one created during the spoofing process to do that you have to write a physical disk label on the disk when openBSD is asked to do that historically it took a very straightforward approach everything was openBSD everything else was wiped out F-disk would create a default depending on which one you wanted GPT or MBR it would take all sectors that are left outside of the actual partition table and round bits to cylinders or whatever it thought was appropriate and wrote that to disk wiping out whatever was there disk label then read the default label initialized either with one of its default build-in tables or a template file that you could specify with dash T a complete set of partition partitions and it would write that disk label into the DOS label sector and it is actually a block value not a sector value of the openBSD partition itself and then the kernel subsequently used the GPT or MBR to find where the openBSD partition was read the disk label from the location in that partition validated the disk label and ignored anything else in the GPT or MBR once it could find and read the disk label that's what it wanted to use and it didn't care what else was on the disk disk label is the unfortunately confusing perhaps program name that it creates examines and modifies the on disk struct disk label it uses an additional IOCDL the W for write which is also used by NuFS and GrowFS you can define up to 15 user you can define up to 15 partitions the 16th is that C partition that is the responsibility of the kernel you can't actually change that FS tab tells you which partitions the kernel mounts the disk label program can generate those entries with the dash F or the dash uppercase F or lowercase F options one generates FS tab entries using DUIDs one generates one using the actual unit value and there's two more defines that pull out and combine into 48 or 64 bit a value is what the bounds that disk label is allowed to operate within and you can change that if it turns out you want to expand your coverage recent changes to disk label include adding a new keyword to the template files RAID so that people can more easily create soft RAID configurations when they're installing OpenBSD two more fields have been garbage collected BV size and SB size one is on the way out drive data which is not used anywhere and the default partition sizes are always being changed as disks grow and the particular partitions need more space the user slash the user slash in whatever so well the modern world is unfortunately a little more confusing than it was in the past UEFI booting in particular new platforms the two that spring to mine or spring to mine when I was writing this ARM64 and RISK64 now when you're starting a new architecture you want to support a new architecture they'll say here's an image and almost always is a GPT formatted image that now has interesting partitions it assumes that you can DD this onto any media and it'll just work they don't care about the sizes and they also have gotten into the habit recently now of using the UEFI sys partition that we created very small just enough to hold the actual boot program now they use that to dump things to allow them to do firmware updates and all kinds of other interesting things so we had to enhance F disk we added a new option dash capital A so that it would scavenge the disk for any of the partitions that were not on that protective list and construct the largest possible open BSD partition in the free space it found at the end of that and the dash B option was added and somewhat enhanced to create the boot partition that is again machine dependent usually EFI sys in the case of AMD64 and other GPT systems there are a few other systems that have FFS I think land disk or long sooner someone does that the reDOS label had to have the validity checks for the GPT relax so that we recognized these images we were more careful about treating the open BSD partitions more like we treated the MBR open BSD partition in particular we didn't really care we don't really care when we're processing the GPT how big the open BSD partition is we only want to know that there's at least enough at the front that we can find the disk label like I say once we find the disk label we don't care about what's in the GPT and the install scripts have had and we also added a feature to prevent writing the disk label on top of data that was configured in the GPT to belong to another partition which was problematic when we tried to do Apple disks the install scripts now have a somewhat more flexible approach to creating the EFI system so they can be larger where we have identified the fact that they are going to need that extra space there's a lot of work going on to enhance the ability of rendering an initial installation using software aid and mostly install scripts have had all the install scripts now use fdisk.sp instead of manually editing in a strange and delicate way the creation of the boot partitions so again we've now been able to create multiple partitions file systems open BSD can use that but as of any block device of course is to boot open BSD so we have three possible paths I'm not going to talk too much about the PBR it's a floppy it knows what it's doing you know whatever floppies do and if you're right sorry if you're using a legacy BIOS it goes into the MBR which tells it how to find the BIOS boot program which tells it how to find the boot program which actually boots open BSD in the UEFI case it finds the GPT in which you've defined an EFI partition in which there is the directory EFI slash boot in which there is the boot something depending on your architecture not EFI file that actually runs in the UEFI space and boots open BSD the boot I think PBR you know we just copy that and place it runs the fdisk is responsible for putting boot code in the MBR if your architecture needs that the BIOS if you're running legacy BIOS then reads that code, executes that code the boot code loads BIOS boot from the first sector for the first block in the open BSD partition and then that calls boot now BIOS boot is installed by something called install boot and what install boot does is it looks at the open BSD a system that has been installed at that point finds out the iNode location et cetera et cetera where the file slash boot is patches those values into the BIOS boot program and then sticks BIOS boot in the first 512 bytes of the open BSD partition thus letting BIOS boot know where boot is so that it runs boot and boot isn't intelligent enough that it can read file systems and boot the kernel and everything else the user slash mdeck slash MBR file is where that MBR code is it used to be installed in almost all the architectures and it used to include i36 boot instructions which doesn't make a lot of sense so we've taken that out it used to include some partition information which again has been taken out because that's now done by fdisk and we removed the interesting mode from ESDI days where you had to hold the shift button down while you're booting so that you can enter the geometry of the disk this hasn't happened in a while so we took that out and made the code simpler so as I said B-B allocates the fysys partition usually it's an MS well all times I hope it's an MS-DOS formatted file system install boot does that form not close enough not close enough oh well I can hold it seems to work not so easy hello I can just talk like this I can just hold it anyways we don't write into the nvram array that UEFI installations or implementations have that store boot images we rely on that defaulting to the final choice which is boot star.efi so there's boot AA64 x86 x64 various ones enhancing that is an ongoing process but right now we just rely on it going down finding the default one which can cause some interesting problems if some other operating systems or other changes you can end up either not booting or getting somewhat confused I think we already covered that part other changes in the install boot part of that process is that it tries to again as a result of more and more architectures storing stuff in the EFI sys partition it tries to preserve contents that do need preserving rather than reformatting the partition conditionally we are slowly trying to adopt more EFI sparts but that's an interesting process that I'm not personally involved in and as I said we're trying to improve the softgrade 4 install process so as you can see, having done that we have in fact done everything you want to do with a block device a couple of important points that you have to remember is where the disk label goes if you have a GPT it will either go right after or at the first usable LBA address which is in the header of the GPT and its GHLBA start plus whatever DOS label sector block value is if it's a non-GPT NBR or nothing it will go in sector 0 plus DOS label sector blocks unless as I said that block is being used by some other partition defined in the GPT or NBR as of very recently yes oh, okay getting quiet, alright readDOS label was enhanced to do that check so that when write disk label says I want to write a disk label there readDOS label will say well I can't find a spot for you it will only look in those two locations after checking for openBSD partition as I said if it finds the openBSD partition it knows where the disk label has to be it doesn't care what the rest of the NBR GPT has if you're writing a disk label to a disk with an openBSD partition it will write it into that particular block and obviously you don't want to use that for anything else and the FFS file systems have at least BB size or 8K bytes reserved for that purpose that's why in your openBSD partition it's always best to have the first blocks be the root partition which is an FFS system and has that space reserved if you want to kill a GPT use fdisk-i to initialize an NBR the reason for this is that because there's a backup GPT at the end of the disk if you wipe out the NBR the productive NBR the GPT header the table there are some very determined biases or UFI partitions that will even though they're not supposed to we'll go and look at the end of the disk and say oh well I found a GPT this is what they really want me to use and confusion results it has resulted the other interesting thing that can happen is that if you've written a disk label with no openBSD partition present so the disk label is stuck in that DOS label sector spot just after the GPT or NBR then you add an openBSD partition it doesn't wipe out that previous disk label so it still exists it'll use a new one you write in the openBSD partition but if you then remove your GPT or remove the openBSD partition or you know remove it and then create it in another location or just remove it that old disk label will then be read which again can cause some confusion for people so just have to be aware that those disk labels still live on disk and if you happen to define your openBSD partition where an old one was what if I talk and Henning just yells what I say hello so what you can do is create a disk label with 15 RAID partitions as the type of partition you can configure a RAID 0 device on each of those partitions which creates a whole bunch of extra block SD devices and then each one of those has its own disk label which you can configure 15 partitions in so in theory you can configure up to 225 partitions on a physical disk so it will appear as multiple devices so in the future or potential future many constraints have been reached it has been strongly suggested in many quarters that it's time really to take a serious look at moving beyond what the current situation is many suggestions have been made this is sort of a random selection more partitions everyone wants more partitions 64 bit offset I'm not sure how soon we're going to hit the 144 petabyte limit more spoof partitions being able to recognize more GPT partitions and actually using them as opposed to just ignoring them if the OpenBSD partition is there further separating the boot code insertion into install boot and moving it out of fdisk doing more at EFI magic like writing the boot image name and the ram making separate in kernel and on disk labels not just copying the kernel structure on the disk back and forth eliminate more mixing of sector and block values so that people realize that they're either dealing with blocks the version or division that the kernel has of the device as opposed to the physical information the device uses possibly multiple OpenBSD partitions one now that we have this again 49 partition device discovered we're probably not going to be able to maintain a list of all these partitions and say don't touch them we're going out to say here's five partitions that we're willing to touch and just leave all the other ones alone there's a couple of expert modes the usefulness of it is pretty dubious and we're probably going to stop supporting pre-48-bit partition offset and size format which used 32-bit size and offset values and once we do that of course it'll be clear sailing for the rest of time so basically with a little care and meticulous planning OpenBSD can turn those bags of blocks into whatever type of useful device you happen to need and thank you for listening questions although I now have the questions Mike so I'm not sure how that's going to work there is another one but which one should I hold and therefore break okay questions there's somebody up there at the very top getting some exercise thanks thanks very much for the talk thanks for all your work on OpenBSD you know the default values in the installer for the offset you know at the start is there any plans or is there I suppose a reason why it's defaulted to 64 yes there is a reason and is a do you want to know what the reason is or you were just curious as to whether there was a reason or not when that was done we wanted that what did they call them various disks that pretended they were 512 bytes actually 496 bytes sectors and we picked 64 as being the maximum value of sector size that we could ever see in the future supporting so we wanted to start at that so that it was at a sector boundary which sped up the IO and obviously time has moved on and perhaps that should be even larger but that's why we picked 64 at the time and the reason why I was asking was out of the boundary of SSDs particularly the there are yes time has moved on and that has not been revisited yet so yes that's one of the reasons one of the fortuitous outcomes of choosing something 64 and using sector values at that time instead of block values was there was a fair bit of space after in GPTs in particular between the end between the end of the partition table and an actual beginning of data and we wrote the disk label inside that sector that was safe to do because we never allocated any space there we only ran into problems when we started using GPTs produced by other people who thought that was a waste of space and put their file systems right after GPT and we ruined a number of Apple machines during that discovery process so yes we would be interested in revisiting that right now or not thank you very much any other questions yes over here hi any plans on new red disciplines in the future sorry new what like red 5 support or red 6 support in the oh yeah that's way better sorry do you have any plans or news on new red discipline support in the future not that I'm aware of but I'm sure someone is thinking of it other questions if you don't have questions I have other exciting slides this is what a disk label looks like there's a C implementation we can go through all the fields in a GPT header that's exciting so just motivation for questions okay no more questions thank you very much thanks everyone