 Hello. Great. Ancient tradition, I introduced this talk. It's been going for what, three, four, five years now? Probably more like four. This is the traditional kernel talk by Ben Hutchings, what's new in the Linux kernel, and what's missing in Davion. Please welcome. Thank you. Let me briefly introduce myself. I am a professional software developer, so I get paid to write on code, paid to work on code in the day, and then I do the same thing in the evening for Davion. And I've been working on the Linux kernel in both those roles for about eight years now. My day job mostly, it's working on extending and fixing drivers and platform code for various new and interesting devices. And in Davion, it's packaging, fixing bad port security fixes, and generally making the kernel work better for Davion users. I am also on the long-term support team, so previously supporting the Squeeze kernel and now the Weezy kernel for five years. And I work on the unstable updates for the Linux 3.2 and 3.16 branches used in Weezy and Jesse releases. I work on getting those updated and reviewed on kernel.org. So, as you probably know, Linux kernel is a fast-moving project. It had a release cycle of typically nine or 10 weeks, so it has about five releases every year. And indeed, since last year, we've had another five releases, 4.2 to 4.6, inclusive. So, as ever, we have new kernel features to take advantage of, but in some cases, we haven't yet packaged or integrated the user space that's needed to use those. So firstly, I would like to have a look at some of the features I talked about last year and see what's changed. Have we taken advantage of those? Have those been developed further upstream? So the first thing is the extended Berkeley packet filter, which is used for filtering all kinds of things now. Network traffic, performance events, tracing events. So the eBPF virtual machine has been further extended to support some of those uses. We have the BPF system call that lets you create a program before attaching it to any particular kernel object it's going to be used with. So it has an independent existence and the file system for keeping track of data structures used by those programs. So it used to be that a filter program would only return zero or one same key for drop the packet. Now, you can have these sophisticated data structures providing information back to user space. One of the problems with extended BPF is it's so powerful that you could leak interesting information about what's going on in the kernel that could be useful for exploiting security bugs. So last year, this was only available to processors running as root. And I'm glad to say there's now a verifier that, at least in theory, is supposed to stop that sense to information leaking. And now it's safe to let all users use this facility and that it is available to all users by default. There was some work to let you write eBPF programs in a subset of C and then compile it with Clang that's now supported by the version of LVM in Unstable. However, most of the extensions aren't being widely used. I couldn't actually find any user space programs packaged in Debian that use eBPF directly, but I might have mistaken, correct me if I'm wrong. And then, yeah, so this is a virtual machine. The virtual machine code is still interpreted by default. There are just-in-time compilers that will translate the programs to native code for certain architectures, but they have some security concerns about those. So they're not turned on, not enabled by default. There are some changes have gone in to address the security concerns, but so I hope that we will have a jit enabled by default soon. OK, so overlayFS is the new or not so new anymore union file system that's actually included in the mainline kernel. So in Debian, that's effectively replaced AUFS, who don't include any more, but it has a lot of limitations. So a lot of things AUFS can do that overFS can't or couldn't. One of those has been removed since last year. You can use NFS as the lower layer so you can possibly use it as the upper layer as well. But the common way of using a union file system would be to have a read-only file system on the lower layer that's imported through NFS. And then you have an upper layer that's stored locally or it's just temporary. That now works. So FAI and LTSP can use over the FS and don't need AUFS. The atomic mode setting, which provides a better way of changing display configuration, is supported in many more drivers. But it's not actually used, even the upstream versions of Zorg and Wayland aren't using that at all. So that's still unfinished. Life patching had a fair amount of interest in people who would like life patching to happen. You might even pay for it. No one's working on it yet. There has been some progress upstream in that last year we had a big major blocker, which was needing to know when it's safe to flip the switch and have a process that was running the old kernel code to start running the new patched fixed code. You can't do that at any point. It might be right in the middle of that old code. And very bad things would happen if it ran a mixture of old and new codes. So you need to have some way of detecting what's the safe point to switch over. And that turns out to require a whole lot of work to understand where each task is logically in the kernel code. And that's been done. But no one has actually started working on life patching in W. It would be nice if someone did. And I certainly don't have the time to do it, I'm afraid. So non-volatile dims are this new kind of super fast flash storage, which has RAM in front of it. They plug into dim slots like regular dims on and appropriately on computers that support this. And one of the neat things you can do with this is directly map that flash storage or flashback memory into a process. You can memory map a file and not have any sort of buffering in system memory. That's called DAX. Last year, we had support for that in X2 and X4 file systems. Now we have it in XFS as well. We also have some kernel infrastructure for configuring nvdims. You can use part of them as a RAM-like mode and part of them in a more disk-like mode. But you need some way to configure, to sort of partition the memory. And this is a lower level than the usual partition labels that you would write with, say fdisc or parted. So you need the NDCTL program to configure this, which is not yet packaged. So I opened an RFP bug. I'm not going to work on it again. I don't have any non-volatile dims or computers I could plug them into. But hopefully someone wants this and will get that packet. And there's the file system level encryption in X4, which is kind of someone's waving at me. Do you want to talk on the microphone? I wanted to be supported in Partman. So I was interested in doing it already, the XT4 encryption. Now, I believe Ubuntu has had support for doing eCrypt FS set up in the installer for quite a while. Is that something that could be built on to do the same thing with the XT4 encryption? Or is it too different, do you think? They're similar. XT4 encryption is better. I realized that there's a, I realized that by advantages. I was interested in, would the process of setting it up be similar enough to? Somewhat. So you want to work on that? Yeah. Yeah, great. So Intel MPX, this is their memory protection extensions, which is the main part of that, is being able to do array bounds checking very cheaply in hardware, which could allow for generating safer code for C in some other languages. Well, generating safer code for C and perhaps more efficient native code from Java and other languages that require bounds checking. So we have support for that in GTC and Glibc now. I don't know whether anything else ought to be done there, whether there's something that we can actually start generating that code. I suspect not because new instructions are just going to crash on older processors. So that's going to be difficult to take advantage of in Debian as opposed to in custom built programs running on Debian. But we've got the infrastructure there at least. Batch network transmit is a nice little optimization for transmitting at high packet rates. That needs supporting each driver. It's supporting more drivers. Nothing needs to be done in user space, so that's all good. And then year 2038 compliance on those two-bit architectures. I'm sure most of us will have got rid of our two-bit systems by 2038, but there are probably going to be embedded systems made in the next few years and using those two-bit architectures for various reasons that will still be in use in 2038. And at that point, sorry, I won't go back a bit. The usual Unix representation of time on a first-to-bit architecture is a first-to-bit signed value. And the largest time value that that can represent is some time in early 2038. So all kinds of time handling code is going to break beyond that point if it's still using 32-bit time values. So there's still been a fair amount of discussion, fair amount of some ideas tried out. Nothing has actually changed yet. I really hope that's going to get fixed in the next year or two. And I believe it is going to require fixes at the application level to actually you to, or at least in the build process, in the same way that you have to opt-in to large file support, you will probably have to opt-in to large time support. But yeah, as it is, there is no new API to opt-in to. So nothing for you to learn to do yet. So onto the new new features that have been added in the last year. So control groups, or C groups for short, are the mechanism for limiting resource usage in containers or even smaller groups of processes. One of the things that we've not been able to do well with containers is to limit what they're doing with writeback. So I'm sure you know that when a process writes to a file, it doesn't have to wait for the data to go to be written to physically to disk, usually. It goes into a buffer and gets written back later. And sometimes it's possible to build up for a large amount of write buffers that take a long time to flush out to disk. And that's bad. It's particularly bad if you end up using up a huge amount of your memory with write buffers. And one of the problems with controlling this with C groups has been, by the time that IO is done, it's no longer has anything to do with the process that generated the necessary. The write buffers are kind of a global control of writeback, but that's about it. So the block IO controller couldn't share out bandwidth between different processes. It didn't know about the processes. And the memory controller couldn't slow down, brought all the processors that were writing too much or writing more than their fair share, because it just didn't know anything about what was going on with block IO. There were, I believe, some code in the memory controller to attempt to do this, but it didn't have enough information to do it properly. Now we have a writeback controller that is properly integrated with memory and block IO systems. And that should do a much better job. Unfortunately, it requires specific, needs some help from the file system that's supported in ButterFS and EXT4, but not yet in XFS. So that's something to bear in mind when setting up container hosts, perhaps. And there's another TNE controller. PIDs controller does something fairly simple. Limits the number of PIDs you can have in a control group, which essentially means it limits the number of threads that can be running in there. And that's important mostly because we limit the number of PIDs in a namespace to 30,767. It's a tiny number. We could actually make it much larger, but there are some compatibility reasons why that isn't done by default. Yeah. So another feature that's somewhat useful for dealing with containers, but also immediately for dealing with virtual machines. This needs a bit of an explanation. All memory that's not mapped from a file is called anonymous. And that can be written out to swap. So basically, all the memory that's used by a virtual machine is anonymous. So if you migrate the VM, all of that has to be moved to the new VM host because it's not in shared storage. For example, a virtual disk would typically, if you're going to do live migration of a VM, you'll typically need to have its disk in shared storage that can be accessed from both the original and new VM host. So for all that anonymous memory, you need to copy it in some way. And I have two broad people of strategies for doing that. The last one called pre-copy is what is normally done now. You start copying the pages that the VM is not writing to. And so copy all the data you can. At the same time, the VM is still running. You'll need to copy all the pages that it modifies. At some point, you can detect. You have to detect that it's dirtying pages faster than you can copy them. At that point, you freeze it, copy the remaining pages, and restart it. Resume it on the new VM host. And that can actually take a very long time if the VM is running across a large amount of memory. So an alternate way of copying is post-copy, which is you freeze the VM right away. You copy a minimal set of its memory, start it again, and then you copy the remaining memory, prioritizing all the pages that it needs right away. But in order to do that, you need to have some way of getting the home VM at a page fault for these missing pages. And then, instead of trying to read in from the swap file because those pages aren't in the swap file on the destination host, you need to intercept those and copy from the original host. So the user fault FD system call and semi-octals that you can run on that file descriptor are the mechanism by which a machine manager can implement post-copy. And QMU KVM does now use this. I don't know whether that's done by default or whether you have to opt into it. But it's there, I believe, in unstable. The CRRU, that's checkpoint and restore in-user space project which can do live migration of containers, we'll probably use this in the future, but I believe it needs a few more extensions because there are slightly different considerations. There's some more complexities when copying a group of processes rather than the single, rather than the virtual machine which is a single process. And then we have lightweight tunnels. This is a networking feature. Currently, if you create a network tunnel, usually you have to create a tunnel device which has configuration of where that tunnel goes to. And what kind of encapsulation is used in the tunnel. And then separately, you have to create a route on top of that. So the configuration is spread across two different objects in the kernel. And the kernel is actually pretty good at handling thousands and thousands of routes. But devices are relatively heavyweight. And the kernel and user space management tools don't deal so well with having lots and lots of devices. So in certain virtualization hosting configurations, you actually want to have lots and lots and lots of tunnels. So the new lightweight tunnels allow you to configure the tunnel as part of the route. If the tunnel doesn't need a whole lot of state, then all you need to do is specify the destination and the encapsulation as part of the route. There are several different encapsulations that are supported with this. Nots. I don't think you can use this for things like VPNs. But then you wouldn't need thousands of VPNs, hopefully. So this needs a newer version of the IPv2 utility to configure it. So Debbie is a bit behind there. I might fix that by myself. But if anyone else is interested in this, I believe that's in collab mains. And it's open to anyone to update that IPv2. So ARM, or ARMHS specifically, got a nice bit of security mitigation. There are specific safe functions that the kernel is supposed to use whenever it copies data between user space and kernel space. And if it doesn't do that, it's a bug. It might be a bug with quite serious security impact. So this is a known problem. There are mitigations against it in GL security. There are also hardware mitigations against this. Intel has implemented something that's called supervised remote access protection. And ARM is done as a privileged access never. And the kernel can take advantage of those features, which turns that bug, it makes that class of bug less serious. You'll get an oops, possibly the kernel will crash completely. Possibly the process will abuse that process that's killed. But it's still not as bad as having someone able to take over the kernel completely. So the Debian ARM-HF architecture is built for ARM v7, which does not have this feature. I think it was added in version 8.1 of the architecture. So even ARM64 doesn't have it, doesn't not even all ARM64 processes have it. However, there wasn't previously unused feature of memory management domains, which turns out to be usable to do the same thing. Or very similar with a bit of software configuration. So that's enabled by default. So Debian ARM systems are just a little bit more secure. So reproducible builds. It turns out that some people had been thinking about this upstream. And if you just set the environment variable, kbuild, build timestamp, then the kernel build will use that instead of the current time. And you get perfectly reproducible kernel image and modules. What hadn't been covered was documentation. Very differences in the documentation from build to build probably don't mean you should be worried about it being exploited by wrong documentation. But anyway, several people worked on fixing the reproducibility of the documentation in Debian. I submitted those changes upstream. They've all been accepted. So that's fixed. Raspberry Pi, for anyone who hasn't heard of Raspberry Pi, I don't know where you've been. But there's a series of low-cost development boards that use the video core SOCs made by Broadcom. These are basically meant for doing video processing, image processing, using the video core architecture, which is proprietary. But all of these SOCs also contain one or more ARM cores. And the default OS for these dev boards has been the Raspberry Debian Derivative. There were a couple of reasons for that, for not using stock Debian. One of those has been that the mainline kernel didn't support these SOCs. There were a lot of different types of SOCs There were a lot of extra drivers and platform code that used to go into mainline. Thankfully, that's been done over the last four years. The graphics driver has been rewritten to run on the ARM core rather than on the video core. And the Raspberry Pi 2 is supported in Debian on testing and unstable. The Pi 1 can't be supported because it only has a V6 processor. Or we could support it with Army L, but I don't think anyone would be very interested in that. And the Pi 3 has a 64-bit ARM, but the firmware at the moment tries to boot a 32-bit kernel. So I don't think we'll be very interested in that until we can run the proper 64-bit code. So the Linux Foundation has started a project called Kernel Self-Protection Project, which aims to add some security mitigations into the kernel. Because it's written in the C language, there are always going to be bugs that allow for memory corruption and potentially compromise the security of the kernel. There's not much we can do to stop that other than rewriting it in a safe language, which is not going to happen anytime soon. And if it does happen, that won't be Linux anymore. But it'll be an interesting project. There is actually a project to write a new kernel in Rust. Should be very interesting. Maybe have a good Demian port to it someday. Anyway, working with Linux, we need some sort of mitigations against those bugs that will inevitably occur. So the Pax and GL Security projects have done a lot of work in this area. But unfortunately, I've had quite strained relations with mainline kernel developers. So there are a lot of features that have been developed separately that have never gone into mainline Linux. This project is an attempt to get as many of those features as possible, or at least the most important features brought into mainline Linux. And it's making somewhat slow progress. But there are a couple of things that have already gone in. Firstly, to reduce the amount of writable data, there are a lot of data structures in the kernel that contain function pointers. If you can find a bug that lets you write to one of those, then you can redirect code flow. And that, obviously, would be a bad thing. So more architectures are now right protecting all the data that doesn't need to be written to. And there's a new option to make more data writeprotectable if it's not statically initialized, but just needs a little bit of code to initialize it once. Then it can be made read-only afterwards. And then the page poisoning feature. This was something that was actually already available as a debug feature, but it was bundled with some other debugging checks so that it had quite a significant performance impact. So the idea, so there's now an option to enable just what's necessary for security mitigation that doesn't slow things down as much, and that might be maybe something we could turn on in the Devian kernel configuration after think about it. Maybe do some measurements of that. And some of the hardening features that have been implemented by those projects have used GCC plugins. A GCC plugin can implement new extensions to the C language and then systematically do interesting things with the generated code or make sort of logical changes to the code before it gets translated into machine code. So that's, I think, the plugin infrastructure is included in Linux, will be included in Linux 4.7, and then after that hopefully we'll get some more interesting hardening features building on that in later releases. The real-time Linux project is another thing that's unfortunately outside of mainline Linux. It's trying to keep close to mainline though. So that adds a compile time option to limit scheduling latency, which is basically what real-time is about. It's important, some people think that running real-time Linux means they get lower latency, and that's not generally the case. It actually tends to increase the average latency. The important thing about real-time is that the maximum latency is limited, and there's a lot less jitter, but the latency of responding to events is predictable. So yeah, it's a long-lived fork, but a lot of the changes that have been made on that fork have gone into mainline. So the difference from mainliner is something like 300 patches, which sounds like a lot, but a fair number of them are tiny bug fixes to make drivers work with real-time. It's had intermittent funding, I would say. Currently the Linux Foundation is paying the Thomas Likes, who's been one of the main real-time developers for years. His now, I think, has paid full time to work on that, and a couple of changes from real-time have gone into mainline recently. The time of wheel is a structure that keeps track of all the timeouts that the kernel looks after. There are huge numbers of timeouts with things like sockets and so on. And basically this is about limiting the amount of time that any operation on that will take. CPU hot plug is actually used for things like suspending or power management, not just for turning off the entire system or for actually physically unplugging a CPU. So CPU hot plug is something that isn't such an unusual operation. And as it is, it can take a very long time, but basically I'm about a time to complete. And so that was a problem for real-time Linux. And so this is gradually moving towards maybe someday being an option for our mainline. And finally, let's quickly run through the changes that we've made to packaging in the last year. The binary packages are reproducible, at least when you use the reproducible build, at least on the reproducible build infrastructure where they've modified the package and a few other things. There's a Stage 1 build profile, which can be used for architecture bootstrapping. That just builds the Linux Libc Dev package. The Linux tools package was separated out, mostly because it didn't support cross-building, whereas the Linux source package did. I've now folded those together and added a build profile so that you can do a cross-build of Linux. In fact, they've also implemented cross-building of most of the user space packages as well. So the Linux package has a whole lot of configuration in the source package to define which binaries will be produced and to adjust the configuration in complicated ways. And I've now added some options there so you can turn off generating binaries, various sorts of binaries, which is useful if you want to build a derivative like Linux GR-SYC that's basically the same as the Linux packaging with a couple of switches turned off so it doesn't build binaries that will conflict with those built by the Linux source package. That might also be something that's useful for our derivative distributions if they have the reasons for building some extra configurations. I've done a lot of preparations for supporting secure boots. My tools are now getting signed, although only with my key, not with the archive key. At the moment, the kernel images get signed and have these secure level patches that block off other ways of inserting unsigned code into the kernel. We're building the user space lock depth and CPU power packages. And I've changed the way that the drivers are packaged for inclusion in the installer. So instead of having to list all the individual drivers and maybe leaving some of them out so that you find that, oh, this kernel runs on my machine, but the drivers I need aren't there at installation times, so how am I supposed to install it? That's now hopefully fixed because the drivers are included by directory name and the drivers are organized in the kernel source by directory, so all network drivers are together, for example. And so we've unfortunately had to drop support, well, not have to, but we have dropped support for some older processors, 586 isn't supported, minimum is a 686 for some definition of 686, which I'm not going to go into right now. We're just about to drop support for MIPS R1 processors, which unfortunately means the Loonson 2E and F are gone. And I rewrote the horribly complicated Perl maintainer scripts. So, well, actually we put a lot of complexity into a new Perl script in Linux base, but it's still much nicer, I claim. So the scripts that are in the Linux source package are pretty simple now. And that's it. Well, there were a lot of other changes that were too small to mention, but that's all I thought was noting today. So any questions? Microphone? Hello. I missed the beginning. I don't know if you have a word on that, but they need RFS tools. What are the plans for this? Or moving into DRACAD or keeping those? Or what is there for the future of that? That's not really your scope of this talk, but I think I'm happier about maintaining the init-round-of-fest tools. I would also be quite happy if DRACAD could replace it, but I think init-round-of-fest tools is in fairly good shape now. And there's no urgent need to switch over. But if we could share a DRACAD with other distributions, then that might be good. Unfortunately, I couldn't go to the DRACAD talk earlier. I would hope to, but I couldn't. So yeah, I can't give a definite answer to your question.