 My name is Byszek, I work in Red Hat on system D upstream development and also I maintain the system D package in Fedora and a bunch of other packages and I'm active in Fesco. And this talk is about a new approach to how we could generate our need or the images. So a short disclaimer that I don't want to say that the stuff that we did before is bad. I think there were reasons, very good reasons to do things the way that they were done, but the technology landscape has changed a bit and I think it's good to think about the new approach. So what is the need and why do we need it? When the kernel initially boots, the bootloader can pass an extra file system to it and this file system is unpacked by the kernel and serves as early user space. Before the real root is mounted. And the main goal of this early user space is to mount the real root file system. The kernel obviously can mount file systems and can often do this also for the root file system on its own. But nowadays we like to have a RAID array and on top of this RAID array maybe put an encryption layer and on top of this encryption layer some LVM structure. And this is too complicated for the kernel and it needs help from the user space to arrange such a storage stack. And this is what the unitRD does and then it from the unitRD we transition to the mount the real root file system and transition to it and then execute the real unit in this real root file system. And in this talk and in system2d documentation we say unitRD even though technically it's an unitRMFS. So an unitRD is a slightly older version of this idea that you have a in-memory block device and read files from this block device. And with an unitRMFS we kind of do away with the block device part and just unpack and archive into a temporary file system and access files from memory without a block device. Yes, I'm using Bimmer. I like it a lot. So let me talk about the current approach that we use everywhere in Fedora. We have Dracut and Dracut is a complicated beast because it labels itself as generating unitRMFS infrastructure and I think it's fair to divide it into like four parts. So first Dracut decides what should go into the unitRD and so you have Dracut modules that provide some functionality and you select which ones should be available in the module in the unitRD image. And there are also mechanisms for dependency resolutions between modules and you know automatically pulling in other stuff when appropriate. Then Dracut provides a bunch of helpers to create the image and the very important one is Dracut install which installs a binary but it doesn't just select a binary. It looks at the binary figures out which libraries this binary is linked to and recursively pulls in those libraries so that actually we can start the binary in the IntRMFS. There are helpers to install kernel modules, Udev rules, arbitrary files and so on and so on. So in general the approach is to construct the image file by file dependency by dependency to make it as small as possible. And then Dracut creates a CPO archive from this and zips it up and the image is ready. And once we reboot into this image Dracut has an execution queue. So it waits and reacts to events and does things. And there are helpers to also do various things in the unitRD so Dracut has a bunch of scripts and hooks and things that happen in response to those events. And one of the modules that we use is a module that installs systemd. So Dracut has Udev and has systemd in the IntRD and of course systemd needs to support running in the IntRMFS and actually many of the things that Dracut now does in the IntRD are done through systemd. So, you know, this is the question whether we need, I mean, what do we need an execution queue that's provided by Dracut to manage an execution queue inside of systemd. And you know, can we simplify all that. So, you might think that the IntRD is super special. But to some extent, this used to be true. I mean, we used to use, we as in, it's like in general in Linux, I'm not sure in federal actually. Binary is linked to K-Lipsy not G-Lipsy and some very strange things were done in the IntRD. For example, busybox was used and so on. Nowadays, we generally just install normal binaries. So if we take Fedora image that is used for example in a VM and pack it up as an IntRD, it will almost work. The kernel expects slash init sim link instead of, well, a slash init binary instead of has been init. And that's how the kernel knows that it's an IntRD image, not a normal operating system. And that's pretty much it as far as the kernel is concerned. And systemd checks for the presence of the IntRD release file instead of OS release. And then it behaves slightly differently. So for example, it will start a different set of units by default. And as I said, the systemd is also present in the IntRD and systemd likes to set up the environment very early on. So it will mount like proc slash dev slash sys and if programs are executed later on in the IntRD, they are executed in pretty much the same environment as they would be in the host system. Because the exact same systemd code is used to set up this environment. So this wasn't necessarily true in the past. I mean it probably was pretty much untrue. But now there isn't that much of a difference. Oh, and I wanted to mention that we use IntRD twice during a normal lifetime of a system. First during boot and then we transition back to the IntRD for shutdown. And the reason for this is that if we build up this complicated storage stack or do some other preparation during boot, then the root file system is mounted on top of this set of block devices. And we cannot unmount them. We cannot properly clean them up until we have unmounted the root file system. So we switch back to an IntRD image and undo everything we did during boot up. So it's, I mean, not only is the IntRD image very similar to a real environment, we switch back and forth regularly between IntRD and the real file system. And I want to say that, you know, like there is no functionality that we would require in the IntRD that we wouldn't want in the host. So all kinds of storage, right, and in particular if we are able to mount, I don't know, degraded radar arrays in the IntRD, we also want to do this in the real system for debugging and I mean it will be strange not to be able to anyway. All kinds of file systems, networking stuff. If we need it here, we also need it in the real system. And not to the same extent, but to a large extent, everything that we do in the host, we also may want to do in the IntRD. So as an example, you might say, well, we don't need Bluetooth and sound in the IntRD, but I don't think this is true. If we enter emergency mode because our disk doesn't boot and we need to debug stuff and our keyboard is on, it's a Bluetooth keyboard, then most likely we want Bluetooth. And two days ago during Matthew Miller's talk, there was a question about the state of accessibility in Fedora. I assume that if you have a screen reader that you use during normal lifetime of the machine, you probably also want it in the IntRD so that you can type in a password or debug things. And this pulls in at least the sound stack and probably various other things. So at least in principle, we need to be prepared to put arbitrary stuff in the IntRD. People commonly ask for SSH to be activated if we enter an emergency mode in the IntRD and so on and so on. So, yes, and this wasn't true in the past, but nowadays we do things through binaries and through demons that do and not necessarily scripts. So we would use the same binaries in the IntRD as on the host. So, I said the track code is those four areas and I want to replace this by a different scheme. So the configuration mechanism will be just a list of RPMs. Let me do this in a kind of a hand wavy fashion for now. So I have some list of RPMs and I create the image by calling dnf install dash dash install root and zip this up into a CPIO archive. And once we reboot into this image, we have an execution queue. We don't need to track which one we can just use system B and let system D do its job. And now all the rest, there's a solution plain old ordinary RPMs and they in general provide all the functionality that we need. So, I mean, people have been talking about this for a while, using an approach more like this, but I mean, like the thing that comes up the most is that the resulting image would be too large. So I hope I'll have time for some Q&A at the end, but let me kind of primitively answer some possible questions now. So if we have RPMs that are too large, then the answer is to split them up. This doesn't benefit just this use case, but it benefits, you know, like container images, cloud installations, live CDs, cloud packs, embedded devices and so on. Or if the RPM has too many dependencies, we actually have a council objective to minimize stuff and you know, we have graphs and so on. So people try to make things, I mean, to reduce the dependency tree. And again, if we do it there, it helps here and vice versa. And if some RPM doesn't work in the interd, since the interd is pretty much a normal, I mean, looks like a normal system, then I would say that this RPM is broken. I mean, maybe it will also be broken in a container or in some other minimal installation and just needs to be fixed. So what does this give us? Well, the installation becomes trivially easy and RPM is very good at installing files. And we don't need a separate mechanism to do dependency resolution. We just specify some top-level list of packages and the dependencies get pulled in. And actually with track with this is a common problem that some new files added to an RPM, the packages, maintainers of this RPM, add this file and take care of changes that need to be done. And then this needs to be rediscovered by track with maintainers because how would track with no unless it's a library that is pulled in for linkage. And we need to add this information to track with. And when we install things from RPMs, we don't pull files from the host. So right now, if there's a corrupted binary somewhere, it will be pulled into the InterMFS image. Images are reproducible, right, because from the same recipe we should get outputted as bit for bit identical in every case. And in general we kind of simplify things because instead of having bar scripts that manage other bar scripts, just use DNF. And we stop wrapping our binaries in shell helpers and shell execution queue and so on. And the last two items are kind of social things. So, right now, if we if we find some issue during boot, the package, the new version of package a does not work with the current version of package B. There are at least three parties involved so maintainers of a maintainers of B and maintainers of track would because the environment is so special that we can't necessarily say what is the source of the of the issue. And by just using normal RPMs in a normal environment, we we do this to two. So it's either a bug in A or a bug in B and it needs to be solved there. So we stop centralizing the management of track with bugs or either the bugs and let package maintainers handle them in most cases. And finally, whatever we do to make things better, it is also using other contexts. I mean, like in the spirit of open source. So, this was like my motivation, my overview, and now I want to talk a bit about my approach how to implement this. And I want to make a disclaimer that, as I said, it's basically DNF install from existing packages some minimal cleanup and you know like the minor additions that need to be added to to create an image and then this is zipped up with CPIO and compressed so I'll be talking about an implementation but it's it's a really minor thing. So mkli is a Python program that was created for the purpose of testing system beyond different distributions so it's like a a small helper that will call for example DNF to install Fedora packages or it will call Pacman to install packages and so on. So it doesn't I mean it sets up a temporary device and area to do the installation but the actual installation is done by program specific to some distribution and the normal use with mkli is then to actually build the the program like system D that you want to test in this freshly installed environment and install it there and boot it for this use case I mean for this case of interdice I'm not interested in the second step so I'm just using mkli to to call DNF and create an image and I added functionality to mkli to create CPIO archives and to compress them with z standard you might ask why the second part is necessary well my images are a bit bigger than the images we have currently and I'm compressing them with xz was noticeably slow so so I thought that it's best to kind of hide this by switching to z standard and I considered some alternatives I couldn't make them I mean I talked with QE and G developers and I had very good responses but I mean it's my use case wasn't covered natively as far as I could see and but I think it could be made to work with some some maybe some even very minimal changes to QE I also looked at OS build so I think that the implementation is not so important because it's really not that much functionality so I mean maybe even if the scheme is adopted maybe a different approach will be used in the end and this doesn't matter so much and now I want to do a demo but this would mean that I would need to share a different different screen so yes I'm saying to Neil yes I think I want to motivate you to provide the necessary functionality I couldn't figure it out but I'm sure it's possible so I will try to build an image for my machine so the command that I'm calling is sudo I need privileged because mksi needs to set up a loopback device and it requires privileges mksi the f means that I want to override the existing file that is already there this is the output file name and I'll call this first with summary and summary doesn't actually do anything it brings out what would be done so I'm installing for Fedora version 34 output format CPIO the output file name and here's my package list and well okay let's now do this for real this calls DNF and DNF let's refresh the cache so let me drop it to the list of RPMs so those were the RPMs that were specified there you'll notice that I specified like some file paths so DNF of course turns this into program package names so I specified like the checker for XFS and XFS problems and so on dependency help I'm a visual Fedora thing low and DNF installs this X files there's some cleanup being done and kernel modules are injected in a slightly different manner and I get an image so that's it more or less I cannot put this image right now because well this would break the presentation so let me return to the previous screen so so this builds an interd image for my machine I could build an interd image for for a different for a different version of the kernel by specifying overriding the version of the kernel package and I can also specify a release version and so on so if I want to build an image for Fedora or for a different version of Fedora of course this works I just need to specify I need to pass some arguments to DNF this all works nicely and so I said that the images are bigger so the image that I built with a VM and bash and a bunch of other things it's about twice as big as the image that Drakut builds on my machine and if I unpack it it is the difference is more than two so I mean Drakut is 77 megabytes unpacked my image is 165 megabytes and if I actually look at the files that are there the biggest difference is that I have 77 megabytes of kernel modules while Drakut selects only 5 megabytes sorry there's some farm equipment moving around hopefully it will go away and this is because I want to take everything to this in the kernel core RPM it has a bunch of modules on the other hand Drakut pulls in modules one by one the things are necessary I think that kernel maintainers are better well know more about the kernel and they could provide some small kernel sub package that would have the basic set of modules that are needed for most machines and this would be better than trying to figure out what is the appropriate list in particular because kernel modules can be provided as modules or they can be built in new modules appear over time and so on it's like with any other package it's better for the maintainers of this package to decide what should be split into which sub package there are some actual binaries so this difference is about two times and this is because basically those are the same binaries except that in my image there is more of them but this is kind of easy to fix because if we split RPMs maybe there are usually no direct dependencies but there are some binaries and it's kind of easy to split binary with multiple programs into sub packages with libraries this could be more complicated because you have dependencies right and in particular those binaries that are present in those images there is a bunch of binaries from systemd and systemd is being split over time into multiple sub packages for various other reasons maybe we will need to do it more and this is because there is a bunch of binaries from systemd and systemd is being split over time and there is a bunch of binaries from systemd maybe we will need to do it more but I think it shouldn't be too hard and then the actual list of libraries and their size and also this means that the libraries that we get is very similar and I have 11 10.5 megabytes more in user share and one of the items that surprise me is that there is 3 megabytes of license files 5 megabytes of zone info data zone info is actually there was a proposal to stop pulling in zone info right now it is pulled in from gilipc and gilipc doesn't really need it so maybe it seems that this will go away on its own and then there is some stuff like certificates and terminal info I don't know we might need it or not it doesn't say smaller things and surprising to me is that jacquard does not install the hardware the binary version of the systemd hardware database and I think I think this is wrong because it's not really an error in jacquard it's on purpose because the idea was that you start with some minimal set of udev rules in the interd and then transition to the real system and then you have more rules there and you redo hardware detection I'm trying a simpler approach where we start with the full set of udev rules that are specified by packages that are installed and well this means that we need the hardware database but this also means that things are described and behave as expected earlier on in some cases so that's it about the implementation and now I will talk a bit about where I want to go with this so we put images on the host because we need to customize the components that go into those images and this also means that the images on every host are on every machine are slightly different and they cannot be signed locally on those machines but you need your own signed infrastructure if we want to have distribution-wide signatures this is fundamentally incompatible and if we are building images from our pms they are produceable and they are the same on every machine well if we use a specific template they are the same so I want to be able to create images in Koji or somewhere in the centralized infrastructure and sign them like we signed the kernel or the bootloader this is not complicated but I haven't answered how to deal with the problem of local customizations and I have an idea how to make this work so let me do another demo I have to switch the screen again system D has this feature it has been out for a while but maybe it's not widely known it's called sys-ext it's very hard to pronounce so I was testing this before so let me undo my status I have no extensions and I also, for the purposes of this demo I installed clang so I don't have clang I don't have the clang package installed and what is system D extension it's a small file system it's a small file system that contains just slash user and at runtime system D will mount the contents of this slash user partition on top of our existing slash user directory so I mean it's easier to do than to say so first let me show this file here you can see that it's less than a megabyte and I will mount it to a temporary directory to look at the contents and if I find on this you can see that it has slash user and a bunch of files that correspond to the contents of the clang RPM if I unmount this so let me do this once again clang is not present and now I will I will tell system D to merge the extensions so what it does it looks in a set of directories like varlib extensions and any that are found are merged into user so if I now look how user is mounted it's an overlay FS so system D is just doing some mount namespace tricks to make the contents of this extension visible and as you can guess I have clang now available if I unmerge things well my user is not a mount point anymore I mean I have normal LVM on on the root drive and no clang again and I can have multiple extensions like this so is this please let me know in the chat if this explanation is kind of sufficient as a small introduction so let me continue so we have this we can do those extensions now the important part the important feature that is useful for the entire D is that those this is done at runtime we kind of build the slash user directory from multiple parts and this is not done before we boot into the system but it's running so returning to the problem at hand I want to be able to customize images and I want to do this centrally so the system D that is running in the entire D can load extensions on its own so the idea is that we would build the kernel as we do in the distribution infrastructure we would build an entire D that matches the kernel also sign it and then build some curated set of system extensions like for example networking or for ice fuzzy or for SSHD in the interface and so on and each one would be signed again because it's built from distribution packages so we trust the contents and then at runtime the boot loader or the firmware loads the kernel and the entire D and verifies the signature on them passes control to the well the kernel and the kernel starts the contents of the entire D and then in the entire D we load we get some list of extensions we verify signatures of them and mount them to make this quick the extensions would be would provide integrated verification with the so the way that the works is that you have a block device with two partitions one is the data partition where you have some files and files and a second partition where you have hashes that verify the block in the data partition has a short hash in the hash partition and this gives us a long list of hashes and then we have another set of hashes that hash those hashes and then another layer of hashes that hashes so we build a tree of hashes and in the end we end up with one hash that verifies the next layer of hashes and then this last layer of hashes verifies the data partition so in the end we have one hash that we need to verify externally this hash would be signed and verified by system D it would be signed during the build but verified by system D before loading the extension and then the kernel would do the rest of the verification and because this is done I mean doing this with DM verity is nice and quick because we don't need to verify everything we just verify the blocks that are loaded when they are loaded and this all together this gives us a mechanism to build a full chain of trust for everything that we load in the interd. So current status so well I use the QMO virtual machine for development this works nicely my laptop also boots nicely I have LVM and I have full description with Lux emergency mode also works and this was actually very simple to do because I could say that I want to add the debug shell service to the emergency target and without any further work I get this for free hibernation and resume also works I didn't test the fancy stuff like networking setup of networking or networking file systems and so on but I mean they work in the host so in principle they could and should work also in the interd some parts are completely missing so in particular I mentioned that we switch back to the interd for shutdown and I sound like this will need to be done in this scheme too but it can be done in a simpler fashion because right now the way this works is that when we are booting the track will zip up the interd image that it was booted with stored in memory for the lifetime of the system and then unpack it again at shutdown so I want to do just this last step so skip the zipping part and unpack a file from the file system for shutdown we don't need to match the version of the interd that we shut down with the version that we booted the machine with yeah so I mean there is thousands of things that can be figured out I mentioned that I made some requests to MKOSI and they have been merged and also to system D this is the biggest parts are done there are still some minor things to do and a big chunk of work will be in packaging if this is to be made really possible in particular the kernel RPM needs to be split better because right now there is kernel core, kernel modules and kernel modules extra and kernel modules internal and the way that the kernel core is the actual image which I don't want and it also has modules but all of them in one big chunk and instead I think it should at least be split I mean at least the module should be the kernel image should be split into a separate RPM for example if we have a direct kernel boot with KVM we need the kernel image on the host and the kernel modules inside of the guest and right now the kernel package does not make it easy you have to install both to get the files and there is plenty of minimization work to be done in particular when you look at the list of packages that gets installed there is debas and it's pulled in by systemd I think it's reasonable to not pull in debas and this means that certain things will not work in the entire debas those are mostly like higher level user-related functionality so that should be okay we have a bunch of libraries in two copies like PCRE and PCRE2 Leapcap which is tiny but it's also annoying to have it twice Shadows utilities is a big package and it's completely unnecessary because it will never create additional users in the EDRD while it's running and even if we were to create some user we wouldn't need to set the password for the user Utilinux is being split up already and there are some strange things being pulled in for dependencies so like the full set of repositories and depends on data that gets fixed and an important one is crypto libraries it seems that we will finally have OpenSSL3 with a nicer license and this means that we will be able to well for example in systemd we use LeapGcrypt because we don't want to link to OpenSSL in all cases because Debian does not treat OpenSSL as a system library and with this upcoming version 3 of OpenSSL this stops being an issue and hopefully we can have just one crypto library required by systemd and then maybe we could have just one crypto library in the EDRD image and also in other minimal installations so polkit is probably unnecessary and if all those things are done then the size of the image should drop nicely and it will be actually pretty close to a to the size of the track with the image and well that's all that I have so to summarize I have built intramfs images directly from system packages use systemd use rpms and sign and verify everything in central distribution infrastructure so I don't know questions do I have Q&A so what about using fsverity instead of dmverity I don't know I will look into this sorry I cannot answer this right now at all how do we deal when people install drivers and things afterwards well so basically it's just that if you add an iskazi mount and you need it in the interd you would also pull in some additional extension in the interd that would be present and it would deal with that so my idea is that right now when we we have multiple intrad images for example for multiple versions of the kernel each one is a complete set of things on its own but the extensions they could be shared between multiple versions of the interd so maybe we could use the space that we have more efficiently so basically the idea is to add extensions what about live media I don't know maybe live media would use a different recipe I don't know enough about live media how are we supposed to deal with third party drivers and such well so first of all even if the scheme is adopted the old scheme wouldn't go away and even if we have centrally built images you can always build an image locally so it wouldn't be signed maybe so one option is to say that well it wouldn't be signed and you build it locally and you're not worse off in any way than right now another option would be to say that if those third party drivers are distributed by somebody else they could sign it and the user could add another signature to trust to local setup but well I don't think that this thing is significantly worse for third party drivers you would probably need to build the image locally and then you just lose the central signing but that's it okay so I hope I have answered the questions and well that's it I guess thank you for attending and thank you