 So, gweinwch, yn fawr! Welcome to this, my presentation on read-only root file systems. So, I'll start off with a quick introduction, if my thing works. This is what happens when you don't, yes. Licences, we'll skip that. Some people may know me already. I'm a trainer and consultant in embedded Linux. I'll be doing this for quite some time. And I recently published a book on the subject. You can contact me there at LinkedIn, Google+, and other places. But this is really about then read only root file system. Why would you need it? I hope to persuade you that it's a good idea. And things that go wrong when you try to implement a read only root file system and some hints on how to put things right again. This talk has turned out to be a bit of a work in progress. I had hoped to present a complete solution to the problem. Turns out it's more complicated than that. So I'm leaving some loose ends at the end here. The thing is that in the abstract, I promised some demonstrations, some live demos. I've decided to leave those out. So I'm denying you the shard and froad of seeing me make an ass of myself with a demo that goes wrong. I apologize for that. But what I actually did in the end is I kind of run the demo in my office. Well, it was all nice and quiet and everything worked perfectly. And I put the screen shots onto the slide. That's called cheating. So why read only root file system? Well, those are the reasons. We're mostly talking about devices running on flash file systems. We want to reduce the wear on flash memory. That increases the length of the lifecycle of our products. It reduces or possibly eliminates file system corruption if the thing is read only. In theory, files can't get corrupted. It avoids accidents. Accidents both at the user interface level, but more importantly accidents at the coding level and the application level. Again, if it's read only, your app can't somehow accidentally buy some. Unown mechanism suddenly delete everything from the root. It also makes it easier to do image updates. If we have a read only root file system, image updates become much easier because we can just drop a new root file system image in there and nothing else to worry about. And in the same kind of vein, it makes factory reset easier because we can just delete the partition that contains the state. So for all these reasons, read only root file systems are a good thing. And it's been my opinion for a good many years that every embedded system should employ a read only root file system. And I hope all of you guys out there who are producing embedded systems, you are of course making your root file system read only, nods. So it's easy, yes? All you're going to do is mount root affairs with the RO option. Job done. Well, of course, no, it isn't. It isn't that easy. Of course there are some things that we need, some files that change, some stuff that we need to retain information of. Some examples are given here, passwords, random seeds and such like. All of this stuff we need to preserve somewhere. So we need to start segmenting the memory areas, the storage areas of our system. So I divided it into three chunks. The first on the top there, we have the root file system, which is going to be read only, that's the whole point of this. But we need somewhere to store the stuff that does change as we go through a session. So there's a second area there, which I've marked as user data. And collectively this is non-volatile information, so all of this stuff needs to be stored in flash memory. And then we have the third area, temporary files, stuff that we don't need to preserve across reboots. And so we can put that into a RAM disk, a TempFS or something. So we have those three basic areas for any embedded system. And the current buzzword is stateless. So in this context then, state is anything that changes, any file that changes. And we can characterize this as being either non-volatile states, or stuff that we need to preserve across reboots, passwords, random seeds, SSH keys, that kind of thing. And stuff that is volatile, stuff that we only need for one session, but we don't need it when we reboot. So to create a stateless root file system, we need to identify the state, things that change. And we need to decide whether it's per session or persistent. The persistent stuff we put into the non-volatile storage area, the non-persistent stuff for the volatile stuff we put into the volatile area. Good. So if we take this just a little bit further, you will see, particularly in the context of containerized systems, Docker and CoreOS and these things, the idea of having a package system, or a componentized system, where we have a base OS, and then we have containers that are loaded into that, the containers themselves are stateless, because it turns out that adding state into a container really messes up the deployment of these things. And so, in that context, these guys will define stateless as being not only read-only, but also able to, all of the components should contain a default configuration, so that even if you wipe out the ETC directory, it should still know what to do. So this is the nirvana of statelessness. Once you achieve that state of enlightenment, then certain things become easier. So a factor reset, for example, is trivial now. You simply need to delete the non-volatile state that puts everything back to the default conditions. You should have a system that basically works and can then be reconfigured for a particular deployment. Likewise, system update becomes much easier. Assuming we're talking about image-base updates here, which I am, then deploying a new root-fast system image is a simpler question of overwriting the current fast system image. You don't have to worry about preserving the state, because the state is stored somewhere else. And this is something that is actually quite a hot topic. I think I've counted four talks this week on the subject of system update, one of which is mine. So I hope to see you all again tomorrow at 1400. So that's what we're trying to achieve. How do we achieve it? First of all, then, I want to look at how do you identify the state? How do you identify the files that are changing? And I'm going to introduce you to a couple of simple tools that every system, pretty much every system has. Disk stats and block dump. Disk stats, then, this is just a file in the proc file system. It gives you a breakdown of disk IO activity per storage device and per partition on that device. Here's a dump of disk stats. Let's come over to the site for a minute. So in this case, we can see that we have on the device, which is a MMC device. We have 2,640 reads as field one. And then if we jump along to field five, 127 writes. And then we can see that the writes are mostly on partition two, 118 there and a few on partition nine. Sorry, partition five. So that tells us where the problems are, where the state has been modified. And we can identify whether we need to investigate this. In this particular device, the root file system was partition two. So that's where we'd want to investigate further. Incidentally, this format is not particularly easy to read from just by catting proc. If you want something, format is slightly better and you have VM stat installed. VM stat does this print out, but it formats it slightly nicer. But it doesn't tell you who made those changes. So the next useful tool is block dump. So block dump is a trigger you can set to enable logging of access at the block IOLO. We just need to write one to proc.sys.vm block dump. And then from that point onwards, it will log all reads and writes. So simple example, I then write a string to a file and then I look at a d-message. And we see that here we dirtied an i-node associated with the file world.txt. We do that twice. I'm guessing that's one of those is when it was created and the other is when it's closed and the access time is updated. And because I'm doing this within an ash shell, we also get a couple of changes to the ash history file. And then sometime later, the journey block demon comes along and it actually writes those changes from the page cache to particular blocks on the storage mechanism. So looking at this then, the interesting stuff is first of all, we want to look for the word dirtied because that gives us the file name. And it also tells us which volume the write was on. So VDA, this actually did this on running on QEMU. So this is using the Vertio block device. So to get a complete history of writes to our file system, you simply need to add an early boot script that writes one to proxys VM block dump and then boot up and then just filter out the junk. So I'm gripping for dirtied and on VDA. We see something like this. So this is booting a clean, newly installed, essentially a Yocto or rather a Pocky core image minimal. And we see all sorts of things are happening. Password file is updated, random seed is updated, the SSH keys for drop bear are written, Udev cache does something, MOTD gets updated, and actually a whole bunch of other things as well. So this gives us a detailed list of things that are changing. One slight annoyance when looking at this is that we only get the base name, so we don't get the full path name. But that's not a huge problem. It also gives us the inode name, so if we really want to, we can use find minus inome or something and get the full path name. If we're not exactly sure where random seed is stored, then we could do a find on that inode and find out exactly which file is being stored in. So we have a list of things we need to look at. So the kind of problems we're having then are, first of all, on first boot, at least on a Yocto project system, quite a lot of bits and pieces of the root file system get updated. So we saw the drop bear writes its SSH keys to EDC drop bear. We see that Udev writes its snapshot to Udev cache and a whole bunch of other things as well. So we would need to resolve those problems either by making those changes at build time, so there's no particular reason why Udev cache targized has to be generated on first boot. We could create that file as part of the build process and then just up in in there. We'll just do without it all together. Udev still works without it, but it does take a little bit longer to boot up. The other solution would be to take these files that need to be changed and create golden copies of these somewhere and then boot up, copy them into a suitable storage area and use sim links or whatever so that they then get picked up. So first time boot is a problem and then we get stuff that is legitimately and has to change at runtime, network config, random seed, log files, et cetera, et cetera. So I'm going to look at some of those in the next few slides and suggest some obvious solutions. Essentially it comes down to, well, I'm taking a pragmatic approach. So essentially I'm taking a standard Yocto build and then I'm kind of fiddling around with it to fix up the problems I find. So typically the way to do this is to take any files within the root file system that are written to and replace them with symbolic links to a user data partition somewhere which is going to store the non-volatile state. Or we can choose to sim link or otherwise provide a RAM disk at TempFS. You can do something slightly different using an overlay file system such as a union FS. So union FS kind of does the same thing but in a slightly different way. You have your read-under-root file system and then you overlay that with a union FS so that any changes you make to the read-under-root file system actually are made in the union FS, which is a separate area storage. So first pass. This is kind of the canned demo that I was going to do and I'm not doing. We need to make the root file system read only. So we add a RO there, which is fine and fair enough. And I have added in a new partition, VDB in this case, and I'm mounting it in slashdata. So slashdata is my non-volatile data storage. And then we use TempFS for various bits and pieces that are the volatile state. So specifically the run directory. There should be nothing in run that needs to be preserved over a reboot. Likewise, var volatile. And you'll find somewhere else that slashtmp is a symbolic link to var volatile. So that takes care of our volatile state as well. So problem areas, log files. Every damn thing wants to write its own logs. Some stuff write through syslog. If everything were to write through syslogdemon, then in some ways that would be easier because we could just use the busybox syslogdemon, which has a log to RAM option, a minus C option. But in general, things don't play quite so nicely. Everyone wants to write their own log files. Do we really care about these log files? Depends. Many times we don't really care about preserving log files for posterity. So the simplest solution is just to make the log files volatile so that we have the log for one session. But if we reboot, we lose that log. Or you can simlink those log files, that log directory, rather, into your volatile storage and you can keep the complete history, or at least some portion of the history of your device. It depends. In my example here, if we're using pocky core image minimal, which is what I'm basing on this on, then by default that actually puts the var log as a symbolic link into volatile log. So if you're using pocky default parameters, you do end up with volatile log files. Random seed, no big deal. We need to preserve in order if you're using the pseudo random number generator dev, your random, which almost certainly you will be, then it's very important to preserve the random seed between boots. Otherwise, you start from the same point in the pseudo random. Sequence, and you have a less secure random generator. So we can either add in a symbolic link so that varliburandom points to our non-volatile, yes, our non-volatile temporary storage, or we can actually just go and hack. It's actually just a simple shell script which does a DD at start and end. So that's easy to fix. Likewise, well, OK, so we can, with Dropout, we can do this in two ways. We only actually need to generate the SSH keys once, so we could do this as part of the build system, but it would have to be done, of course, per device. There's no point having the same SSH keys for every device that kind of defeats the object. So again, the simplest thing to do in this case is to generate the SSH keys at boot time, but by symlinking or other mechanisms, store them in the non-volatile area. Doing less on first boot. This is kind of stating the obvious, maybe. The reason I haven't got into more detail here is that in order to solve these problems, you actually need to delve into the packages and the boot system and fix up a bunch of things. So I'm kind of skipping this. OK. Looking at some concrete examples of systems that are deployed out there, Android and Brillo are a good example of how to do this properly. One of the nice things about Android is that right from the very start, it had a stateless root file system, except they didn't call it root, they called it system, but same concept. And all of the state in an Android device is stored in the user data partition, which is mounted on slash data. And as a result of that, Android devices are able to do factory resets very easily. You just wipe slash data. And they have the OTA update mechanism, which allows you to update the system image and therefore update to a new version of Android. Yogto is kind of edgig its way into this area somewhat. So right now, and for some time, there has been an image feature, read-only root FS, which does what it says. It mounts the root file system read-only. The main problem is it doesn't have anywhere to store the non-volatile state. So if you actually do this, you will find that, for example, the drop bare SSH keys are stored in a RAM disk, which means that every time you reboot, you get a new set of SSH keys, which means that when you try to log on to it, you have to re-establish those credentials. Also, it doesn't bother to keep the random seed, so it means that you have a less secure random number generator, and so on, and so on, and so on. Now, the reason that the Yogto guys haven't done this is because it's difficult. It means you need to go into every single package and make it state less aware. So I actually got through this slightly quicker than I was intending to, probably because I didn't do the demo. So conclusion then. Yes, read-only root file systems are a very good idea, as I think we all agree. We have tools at distat and block dump, which allows to identify problem areas and to identify individual culprits. So we know who is changing what. And we have some ideas about how to resolve problems by symbolic linking files to the non-volatile storage, using union FS, and using tempFS for volatile data. And that's almost literally it. So since I've gone through that fairly quickly, we have time for a Q&A. The slides themselves, by the way, they should be on the conference web page, but I'm not convinced. I haven't actually checked if they're there or not. They're not present. OK. Well, they are definitely present on Slide Share because I uploaded them last night. But they will, in the fullness of time, appear on the conference website as well. So any questions? OK, well, you were the first, so. Correct. As I describe, just simply adding symbolings to every file that gets changed. And then you erase the data directory, you will end up with dangling symbolings. So I didn't literally mean that you would wipe the entire partition, but you would wipe the content. So you need to have a directory structure in that non-volatile slash data directory. And you need to erase the individual parts of that. Yeah, so I'm skating over a couple of things here. So in the pragmatic case, you would need to preload the data directory with default configurations for the ETC, PWD, for example, ETC group, and so on. You would need to preload known versions of those. In the ideal case, if we were to, instead of doing that just by prepopulating the non-volatile storage, if we were to actually build into every single package censored with defaults, then we wouldn't need to do that. So that's the case with Android, for example. With Android, you can literally erase slash data. And the Android, when it reboots, will regenerate the things that will repopulate the Dalvik cache and all the things it needs to do. We're not really quite there yet with, say, Yocto. Do we have a microphone anywhere? OK, right. It'd be useful if you could ask questions via the microphone there, because then that we picked up and it would be recorded for posterity. So yeah, next question. Would you mind? You maybe have to turn it on? OK, now it's on. Would you mind losing some words on the advantages and disadvantages between sim linking and using an overlay file system? So the differences between sim linking overlay union FS are such like. It's kind of a personal thing, maybe. I prefer the idea of sim linking, because I feel it gives me more control. I can see exactly where the sim links are. And if you happen to modify a file which doesn't have a sim link, then you'll get an error. Whereas if you simply plonk a union FS on top of your root file system, then any right to the root file system will be allowed. And so you don't have so much control over what ends up in the non-volatile state. Plus you get dependency. Well, I don't know. Explain that a little bit. I would say that, yeah. So the one disadvantage of union FS is you have to build union FS into your system. Yes. It's certainly true, yeah. Over there at the back. Again, it'd be useful if you could use a microphone. Just a question about the sim links. A group I've worked with have looked at doing this. And they quickly found that there's certain things that use ONO follow on their opens of files in Etsy. And all of a sudden, you have to start worrying about patching stuff to take that out. Because if you go to sim link, get into your non-volatile partition, the opens start to fail because it won't follow the sim link. Yeah. So some clever people have identified sim links as being a potential attack vector and have coded things not to do. Do you have any examples of that? I thought Etsy group was one of them, actually. But I'm not sure. Yeah. It started to be when we were discussing about how much we'd have to be be appended in our layer to try and patch out the stuff. There was no simple answer to that. We're changing the behavior, we're changing the definition of certain files. We're moving them one for one place to another. So there may be cases where we have to add some patches. Yes, I agree. They actually decided to use overlayFS instead because of that. So using unionFS would kind of avoid that problem because things are apparently in the same place. The path names don't change. I feel that the real solution to this is to actually push this out into the packages and make the packages stateless aware. But we're not there yet. Thank you. Over there? My question is still about sim links. We thought about doing sim links, but we found out that, as you said, a potential attack and all that. And we implemented this with biomounts. Do you have any sort of view on using biomounting from a nonvolatile state to the volatile state with just simple biomounting, FSTAB, or system demount services? Yes, so we could use the bind mount, which allows us essentially to mount one directory to move it into a different file system. Yep, that would certainly work. I mean, it only works at the directory level, so you'd have to bind it in time. It works on the file level as well. You can bind on the file. You can bind on the file. OK, I apologize. It works on the file level. Yeah, that works. I've not actually used that technique. I have a feeling that it's higher overhead, but that's just a, I don't know if that's something to do with it. I mean, it is a bit more overhead than the sim links, obviously. It is traceable. You know everything where it goes, so it's easier to reconstruct the state because you know you have to find what you want to do. You know what you want to buy, so you can redefine. You can repopulate your non-volatile state quite easily. No, OK, that's a good suggestion. Thank you. Person over there? Question about SystemD. I've seen SystemD has some support for doing this kind of stateless. How much of it is there and how much of it works. Is it complete when you get to the end of it, or is it going to leave me halfway with a half solution? SystemD has a nice set of tools which allow you to do some of the things we've been talking about in a simple way. The problem is that you, I mean, it's not for free. You have to do some changes to the packages to use SystemD in the right way instead of the scripts and so on. So it's not a solution itself, but it is a tool which gets you closer to that solution. So, for example, it has some quite neat little... One particular area which I haven't actually covered in the slides is the question of user IDs, which are stored in ETC Password. So as you install packages, quite often they have to add user IDs for that particular package, for the demons that are going to be running within that package. So, therefore, ETC Password becomes a file that needs to be updated when you install packages, which is a pain. So, SystemD has a nice set of mechanisms, actually, which allow the ETC Password to be a volatile file, and there are some units you can add into SystemD to create user names, user IDs, kind of on the fly. So, that kind of thing is in SystemD. It is improving as the releases go along. So that's certainly part of the solution, but it's not the complete solution. I... When I wrote the slides, I kind of tried to not make it SystemD specific because an awful lot of embedded systems are still using other init programs. SystemD is by no mean ubiquitous for us embedded guys. Okay? It's on its way. Since we're mentioning Yocto and Pocky, there is actually a mechanism to go along with the read-only root file system. I don't remember the name of it right now. It's a volatile pass or something like that. It's a recipe where you can actually add mount points that you want to bind-mounted if the root deficit is read-only. By default, it will do that for barlib, but it can easily be extended to do it for anything. For example, we've used it to bind-mount EDC if the root deficit is read-only. Okay, and so is that a class or a BB class? It's a recipe, so you can just BB a pen to whatever... Oh, okay, right question. ...find mounts you want and where they should be mounted. Cool, and the name of that recipe is? I think it's volatile pass or something like that. Volatile pass. It's definitely volatile something. Volatile binds, yeah, right. Volatile binds. Fantastic, thank you. Okay, anyone else? If not, we'll call it lunchtime. Okay, well thank you very much, and hope to see you around.