 Hello, everyone, and welcome to the annual Colonel Talk, back by popular demand after last year's hiatus and incredibly poor quality substitute. Welcome to Ben Hutchings, with what's new in the Debian Colonel? First a little bit about myself. I have been working on the Linux kernel and related code like NetRamFS and firmware packaging. I have for Debian, and in my paid jobs for over 10 years now. I'm a member of the Debian Colonel team doing about half of the packaging work now, accounting by commits, and I'm also on the long-term support team doing mostly the Colonel updates for the long-term support suite. I also maintain the Linux 3.16 stable update series on Colonel.org, and used to do the same for 3.2. Basically, I look after the versions that we are using in older releases, and I'm maintaining a stable branch based on Linux 4.4 for the civil infrastructure platform. As you might know, Linux releases early and often, about five times a year in fact, with stable updates with bug fixes even more often than that every week or so. Sometimes when features appear in the Colonel, they aren't quite ready. They're a bit missing, which may be filled in in a later release. Sometimes they also require some support from user space packages and other changes in distributions. I did actually give this talk last year, just not at DebColf. I talked to it in Cambridge, and at that time I covered changes up to Linux 4.14. Since then, there have been another three releases, and Linux 4.18 is, I think, likely to be out next Sunday. I'm going to talk about those changes in those four releases. Lots of new Colonel features coming, and some of them are going to need changes elsewhere in Debian. I'm not going to be... The Colonel team isn't going to be doing all of these things. It needs some help from other developers. A recap of the features that I talked about last year and what's changed there. I talked about Scared Util, which is a CPU frequency governor. This is responsible for part of the power management for your CPU, and Scared Util, unlike older governors, is integrated with the Linux task scheduler, which, in principle, allows it to make smart decisions. I would hope that, at some point, we could make this the default governor, but we haven't done that yet. So, sadly, no change. I talked about Zoned Recording, which is a necessary change for very high-capacity hard drives using what's called shingled magnetic recording. There's utilities for controlling that called DM Zoned Tools. I opened a request for package for that last year, but I haven't seen anything, any movement there. So, if anyone's interested in those very high-capacity SMR drives, please work on that. I talked about the block layer being replaced. The block layer is a core part of colon code that manages most access to mass storage, hard drives, SSDs, USB storage devices, memory cards, and so on. There's a major overhaul in the interface between the block layer and drivers, which should allow for higher performance and other good changes. The changeover has been gradual because it's simply not possible to convert all drivers at once. So, the SCSI driver supports both old and new modes of operation. We changed over the default mode in Debian, I think, with either 4.16 or 4.17, and it doesn't look like anyone's complained about that, so that went well. The MMC subsystem, which is used for most embedded flash and trusty cards, that has finally been switched over to the new block layer. MD RAID, I think, is the last big kind of storage driver that hasn't been switched over. There was a little bit of movement on ARM graphics drivers. The Mali kernel driver, in fact, I think this had already happened by the time of my last talk, but I didn't notice. The Mali kernel driver for ARM Mali GPUs is available under GPL. However, the user space support for this GPU is proprietary. As a result, that driver has not been accepted into the Linux kernel itself, and it's only packaged in the contrips section. The Statex system call, which is useful for some applications, is now supported by the C Library since Julia preceded version 2.28, and a few applications have started using that, that can improve performance for some applications. So, let's get on to the actual new features. Risk 5 is a new-ish instruction set architecture. The Risk 5 project started in 2010, but it's only in the last few years that they finalised what the instruction set will look like. It's maintained by an industry consortium, and it's more or less open. The documentation is all public, and, as I understand it, there are no licence fees. There are no essential patents that you would need to implement it. The actual implementations themselves may well be proprietary. In fact, I think most of them are. It supports 32-bit and 64-bit modes, and has room to extend to 128-bit. At this point, adding new 32-bit architectures seems kind of silly, so in Debian, we're only looking at the 64-bit version as Risk 5 64. There are also lots of optional features to allow for that scalability. However, there's a common feature set which has been specified for all processes that are meant to be meant to run general purpose operating systems like Linux. Basic support for that was added in Linux 4.15. There's been some more work there to add a console driver so you can actually read output from the kernel and to support performance monitoring and function tracing. I've talked several times about security hardening features which are not really user-visible features, but they help to defend the kernel and reduce the impact of bugs that could otherwise have a very serious security impact. The timer list structure in the kernel is used to track timeouts, work that's been delayed, work that needs to be done regularly, but the timing, the exact timing, isn't very important. In the past, this has been a very attractive target for buffer overflows because it allows you to choose, if an attacker can find a buffer overflow bug that lets them overwrite a timer list, then they can choose where the code flow is going to go by overwriting the function pointer and they can also give an argument to that code. The change has been to always give the timeout function a point to the structure instead, so that reduces its usefulness to attackers. In future, it's hoped that the kernel can make use of control flow integrity, which would do a kind of runtime type checking, so it would be impossible to replace the function pointer with a pointer to anything that wasn't meant to be a timeout function, or you could replace it, but the result would only be that the kernel would crash rather than allowing you to take over and get a privilege escalation. User copy is the code that's used to copy data from user memory to kernel memory or back during a system call. This obviously is very security sensitive. We don't want it to possibly possible to overwrite arbitrary bits of kernel memory. In Linux 3.7, some range checking was added to this, so that would limit, that would prevent buffer overflows that would go beyond the scope of one stack frame or one heap allocation, but there was no limitation there if you had a structure that included an array that allowed, if you had a structure with an array in it, and there were some system calls that would read and write from that array, and there was also some sensitive data after that, the range checking would not prevent a buffer overflow from the array to the following data in the structure. Now there's an option if you create a, if kernel code creates a private heap for allocating particular type of structure, it can define a white list, a sub part of that structure for which user copy is allowed, and the range checks would then be more, would then prevent an overflow into other sensitive parts of the structure. So now I'm going to take quite a long digression to talk about something that came up at the beginning of this year. It's not really, not really a kernel feature, it's not even a kernel bug fix, it's a common flaw in a lot of processes. You've probably seen these logos and talk about Meltdown and Spectre. These are problems that have arisen from speculative execution in CPUs. So speculative execution is a common implementation technique that allows processes to avoid waiting for slow operations. At the moment, processes can run hundreds of instructions in normally in the same time that it would take to read data from main memory. So in the case that an instruction needs to read from main memory because the data is not in the cache, you really don't want all the instructions following that to have to wait. So the idea of speculative execution is you predict some of the results and and then later when you know the real results, you can decide whether to keep the result of that speculative execution or discard it. So the results are all buffered and if the prediction was wrong in theory, no one is the wiser, no one knows that prediction was wrong. However, that kind of misprediction can result in bypassing access control and it can result in an attacker being able to maybe that an attacker can control the way the speculative execution would jump to different addresses. And even though the results are discarded, that can leave a trace in memory caches. So after this mispredicted speculative execution, an attacker may be able to time how long it takes to access a particular area of memory that may or may not have been accessed during the speculative execution and then they can tell what happened and they can find out information that they wouldn't otherwise have been able to access. For example, encryption keys that exist in the kernel or in another process. Fixing this properly is going to require quite big changes to processor design. For the moment, the best we can do is to mitigate it with microcode updates and with updates to software in particular the kernel. The kernel is particularly important here because of course it's the most privileged code or some of the most privileged code running on the processor and therefore a target of attack and also it's the piece of code that's responsible for configuring the CPU. So I'll just briefly go through the issues that we've seen. Specta variant one, it was described as a bounce check bypass. What happens here is that a test for whether an array index is within the bounds of the array is predicted to be true because an index out of range is in fact extremely rare and that usually works out to be right. However, the execution can continue speculatively with the out of range index in some cases and that can result in getting information about data that's outside the bounds of the array. A general solution for this could be done in a compiler but would have quite high performance costs. So what's being done in Linux is to add a mitigation to specific array lookups that are thought to be sensitive. So that's an ongoing effort. Specta variant two is described as branch target injection. The idea there is that you can train the indirect branch predictor in the CPU that a branch from a particular address will go to another particular address and that sort of works because the lookup based on the code address isn't precise. So you can train it by performing jumps in user space code and it will apply the same, it will apply what's been learned about that to another piece of code in the kernel at an address that has some of its bits the same. Most CPUs on devian release architectures were affected by this and the first variant. We have mitigations on x86, power and system Z which involve disabling or defeating this branch predictor and that's done in kernel only at present, possibly also in Xen. There are some additional mitigations to x86 that rely on some new features added in microcode updates but those are only available if you enable the non-free section of the archive or if you're lucky enough to get a biosaw or if I update from your hardware vendor. Meltdown was probably the most serious of these, it affected a smaller set of CPUs, Intel x86, some 64 bit ARM chips and most of the recent IBM power CPUs. On AMD 64 and ARM 64 it's been mitigated by page table isolation which means that the kernel memory is no longer even mapped into uter space processors so there's a switch in the virtual memory page tables whenever a system call or interrupt happens and on power the mitigation has been to flush part of the CPU memory cache. This really slows down system calls and interrupt handling. It's had quite a significant performance impact for some applications and there isn't currently a mitigation for i386. Unfortunately there are some differences in the way interrupt handling is done between those two bit and 64 bit x86 processors which means the mitigation on AMD 64 doesn't transfer across and there are yet more issues, spectra in G variant 4, cool speculative store a bypass that turns out to be partly an issue for the kernel but mostly an issue for sandboxing. So the issue here is if you have code that writes to a particular address and then a few instructions later loads from the same address but the addresses are calculated in different ways the CPU might predict those are actually different addresses and therefore the later read should use the will speculatively use the old value stored at that address and in some circumstances that results in a leaking sense of information. The mitigations that were already implemented for spectra variants 1 and 2 have mostly dealt with that but not completely. There are some additional mitigations available in x86 now but they rely on new microcode and they have quite a substantial performance impact. Currently those are applied by default in processes that are using sandboxing but there is a kernel command line option to adjust that. There is a floating point leak of floating point and vector register contents which affected only Intel x86 CPUs as far as we know and that is significant because it can leak encryption keys from one process to another. Thankfully it didn't affect the most recent Intel CPUs because they have a feature that meant Linux didn't use the optimization that made this a problem. The mitigation for that is turn off the optimization that was a relatively easy mitigation and there are a couple of new variants of spectra that I should not go into. So the year 2038 problem is something similar to the year 2000 problem. If you are using a 32 bit Linux system then the time is represented using a 32 bit signed number, number of seconds since 1970 and that will reach its highest possible value in early 2038 and that will wrap round to a negative value and everything related to time will go horribly, horribly wrong. So hopefully most of us won't be using 32 bit systems in 2038 however there will be embedded systems being built in the very near future that will need to carry on running past 2038. So we need some changes to the Linux kernel APIs and the C library that will allow for 32 bit systems to use a 64 bit count of seconds. As far as I can see all or very nearly all of the kernel internal interfaces have been updated to use 64 bit time even on 32 bit architectures and most of the time related system calls the implementations of those can be built in 32 bit configurations. However so far no architecture has opted in to building those system calls and assigning numbers to them so it's not quite an estate where you can actually take advantage of this. Also the GNU C library still doesn't support this. It will need to be backwards binary compatible with old applications that use 32 bit times so it will be necessary for it to support both 32 and 64 bit time interfaces at the same time which needs quite a lot of intricate changes and the review of that is going rather slowly. So it looks like at this point this is not going to be ready for Buster but maybe for Bookworm. Now something that I've run into repeatedly is that 32 bit programmes on Debian are not being built with large file support. There was a similar issue where file access interfaces used 32 bit offsets and sizes and that meant you could not access files larger than 2 gigabytes which seems absolutely ridiculous today. So something called the large file summit defined new 64 bit interfaces for file access but they're opt in. So there are still binaries being built for i386 and other 32 bit architectures in Debian that can't access files larger than 2 gigabytes. They also don't work on very large XFS file systems which can have i node numbers larger than the 32 bits. So I do wonder whether that should be enabled by default. The connection to 64 bit time is that some interfaces like stat deal with both file sizes and times and the C library developers do not want to implement four different versions of those. Therefore if you want 64 bit time you also need to support 64 bits file sizes. So I wonder whether it would make sense to change the default build flags in debugged build flags to enable both large file support and large time support. Definitely something to do early in the release because that's probably going to shake out quite a few bugs. Now one of the important features that the kernel has for containerisation is username spaces. Username with username spaces by default any user can create their own name space and they'll be the root user in that which means they can control everything in that username space to a great degree but they still in principle don't gain any privileges. In practice this has exposed a lot of security bugs so although the feature although username spaces are available in Debian kernel builds it's the ability of users to create their own is disabled by default. Most Linux file systems are not robust if you give them a disk image that is carefully constructed you can exploit bugs to cause buffer overflows and other kinds of problems in the file system code and then you can do whatever you want in the kernel. So for this reason the mounting of file systems is restricted in username spaces you can't you can only mount specific types of file system and up until now that's mostly been virtual file systems like the proc file system, sys file system and so on. Recently there has been some work to improve fuse the file system in user space code in the kernel. The idea of fuse is that the there's a the main file system implementation runs as a user space server and the kernel then just takes care of packaging requests from other processors and sending them over to this file system server process. So there's relatively little code running kernel and in theory now that's quite robust. So at this point the root user in a username space is allowed to mount file systems using fuse. In theory any file system you want could be implemented through fuse. I had thought I read about a project to allow regular Linux kernel file systems to be rebuilt as fuse servers but I can't find any trace of that anymore so either I imagined it all that project has failed. I do wonder whether it would make sense to start packaging more fuse file systems though. This might also be useful to make auto mounting safer if we could run the file system code for hot plugs devices in user space and we could take advantage of the various sandboxing features that are available in user space that I think would reduce the the danger from hot plugged untrusted devices. Another change back in Linux 4.15 is to SATA link power management. SATA is the high-speed serial link that is used to connect most hard drives SSDs and optical disk drives. High-speed serial links tend to draw quite a bit of power as long as they're running even when there's no data to be transferred so generally it's important to have link power management which can switch them into a lower power mode when there hasn't been any data to transfer for a while. That not only saves power on the SATA controller but can also save quite a bit of power on some Intel processors. Something that's been tried is aggressive link power management. I'm not quite sure what makes it aggressive but that can give much higher power savings but it turns out that some drives simply don't implement this correctly and it comes out in data loss so that has never been enabled in mainline Linux. What changed in 4.15 is the generic SATA controller driver going support for a mode where it sets the link power management settings similar to what Windows uses with Intel SATA controllers. That can save more power than the previous defaults and we believe it's believed to be well tested because that's what Windows uses on most laptops. We enabled that power saving mode in a Debian kernel study in 4.17. It does look like there have been some boot regressions related to that though so probably some more work needed there to maybe blacklist some drives that don't work so well with that. Sorry, so far as I'm aware there's no data loss here but there are having boot hangs. Finally, there have been quite a few changes to the packaging of the kernel in Debian. As you may have seen in the previous talk about secure boots, we now build a template source package that will be used by the signing service to produce signed versions of the kernel and its modules. We have a more flexible way to select in the source package which binary packages it will build. That's something that's been requested for use in derivatives and I also found that useful for backporting Linux 4.9 into Jesse for long term stable. The kernel config files, we provide copies of the configuration we use for the binary packages to use as a starting point for building custom kernels. Those have been moved into new binary packages because the way we were, previously we were including them in the Linux source-version package, that turns out to be impossible now because the kernel configuration system wants to run the compiler and probe the capabilities of the compiler. In order for that to work, we would have needed to install lots of cross-compilers while building Linux source. I decided against that, so that was the solution. I've removed all the remaining dependencies on Python 2. All the Python scripts that we used to build were already switched over, but I've now also changed the documentation build to use Python 3. The perf documentation build was using ASCII doc, which is implemented in Python 2. That's now switched over to ASCII doctor, which is in not Python 3, but Ruby. I think there are one or two other things, but it's all you can build without Python 2 anymore. That's a step towards removing Python 2 from the next release. There's been a conflict change to add ARM HF and ARM 64 packages that are built with real-time support that hasn't yet really taken effect because we don't have a real-time patch set for 4.18 yet, but as soon as there is one, those packages will get built. Finally, all of the packaging of Linux and other things that the kernel team takes care of, all those repositories have been moved to Celsa. We're open to merge requests. I very much prefer dealing with merge requests to patches attached to the bug-dragging system. That's all I got. Any questions about these or other changes that have taken place since 4.14? Anyone? Not that one. I had a question about G-Lib C. You said things seem to be moving slowly on the 64-bit time front, but I remember when they first started implementing 64-bit file access, file offset, there was a whole slew of Lib C-level calls that had to have 64 stuck on the end of them. Is there any particular reason? I just now reviewed the number of all the G-Lib C calls I could find in the InfoDocs real quick that take or return a time tee. It really doesn't seem like that big a job, famous last words. Do you know what the holdup is? There are quite a few. It's not just system calls, but there's the library calls like mq-time, diff-time, and there are the ABI compatibility constraints, and there's issues with which things can be defined, the names of things that can be defined without breaking backwards source compatibility. It just seems like there is a template for doing exactly this kind of transition before with 64-bit file offsets. If you want to support 64-bit time on 32-bit systems, you can just call as time 64. The idea is, in fact, with LFS you had the 64 suffix functions, but also you could define underscore file offset, but it was 64, and then all the regular functions would get remapped by macros or whatever. The same thing is going to be true with I think it's time offset bits, but I can't exactly remember. But you said something about how they don't want to support four different configurations. Right, it'll be three different configurations. Okay, that's bad enough. All right, thank you. I guess that's it. You said there were packaging changes to do with selecting the different binary packages that were built. Which version did those land in? I think it was 4.17. Okay, thanks. Some of those I cherry picked back to do the Linux 4.9 in Jesse. Anybody else? That seems like an absence of hands. Thank you all very much, and go get your coffee early. Thanks, bye.