 Hi, I'm Leonard Patalinga. I'm going to talk a little bit about what's new in SystemD in 2018. This talk is basically just a collection of a couple of things that I found interesting in the development of SystemD in the last year, basically since the last old systems go. There is no particular order, and that's completely fine if we don't cover everything that I have on the slides here. So if you have any questions regarding any of the topics I'm bringing up here, by all means, do interrupt me and ask questions right away. I very much prefer if we can make this interactive than just do the questions at the end. That said, I only got half an hour, so I probably should get started. So yeah, as mentioned, no particular order. The first thing is portable services. Not going to talk much about this, because I actually got another talk later today in like an hour or something just about that. But it is a biggie, so I put this first. What I actually do want to talk about is boot counting. This is something that it's not merged yet, but it's pending. And like it's a PR in SystemD, and we're probably going to have that with the next release. Boot counting is something like if you build operating systems that shall be somewhat resilient to failure, and that can actually recover automatically from failed boots, you need something like boot counting, meaning that the boot loader needs to somehow know if an update was successful, and if it wasn't, revert back to the old version of the operating system. There are various operating systems that implement something like this. For example, Chrome OS does, and so does the real chorus. But most of the general purpose distributions have nothing like this. And all the solutions for this so far have been like local solutions for the specific operating systems. Nobody tried to make this a commodity, tried to make this generic, so that it's generally useful by general purpose Linux distributions. With this feature thing set in SystemD, we want to generalize the general concepts around it, as well as one specific implementation around one boot loader, the one that we ship ourselves, SDBoot. But everything that this entails is actually generic enough so that it can hook it up to other boot loaders. And there has been work to make that work with Grubb as well. How this actually ultimately will be no civil to use this is that when boot counting is enabled, the boot loader will boot one version of the operating system or of the kernel, will boot it up, and then some tests can run after the kernel was booted up and figure out if everything's OK. And only when these tests parse, the boot menu item will be blessed. And if it's blessed, then the next time it's going to be booted again. If it's not blessed, a counter was decreased, and it's decreasing from 3 or something to 0. And when it reached 0, then it will not be attempted again to boot this. So if you do high reliability server stuff, if you do embedded stuff that's interesting to you, if you do desktop stuff, it's probably not so interesting to you, because if there's a human person sitting in front of the computer, this is not so interesting. But for everything else it is. There was a question somewhere. Do we do the microphone? Or I can repeat the questions if that's quicker. The revert on the next boot. OK, so the question was regarding what happens with failures regarding, for example, video drivers that usually require a user to identify if everything's OK. That's actually a good point. Like the Fedora people, like one of the Fedora people, Hans de Huda has come up with exactly that issue. They want for the desktop stuff something where only after GNOME shell has been booted up and the desktop environment figured out that everything's OK as well, that only then a particular kernel that was booted is blessed. We currently, like this PR that is almost ready to merge does not have that functionality yet. But most of the concepts that general ones are extensible to that point. And it's very likely that before we do the next release, we'll also add that functionality so that this can be used in Fedora right away. OK, let's talk about the next one. Something that is also pending as a PR is, you know, Antspawn is like this small container manager that is inside of SystemD. It's like cheroots on Stero. It's what we wrote to test SystemD with. It's actually generally useful now. I have added OCI runtime support. OCI is a specification that came out of the Docker container stuff that's supposed to generically define how containers are supposed to look like. Antspawn pretty much implemented everything needed to do OCI stuff, except that it didn't implement OCI itself. So given that it looks like maybe hopefully people can agree that OCI is the way how containers are put together, it made sense for us to support that in Antspawn natively. So yeah, the idea is basically that the basic building blocks of the operating system that you can just run your containers as the executor is not going to solve how the containers actually got onto the system. That's for other people to solve. But I think the long-term goal that would be actually useful to have is that Kubernetes could just use that thing directly so that the actual execution of the container is no longer something that people have to think about in the upper layer, but the execution of the container is actually just functionality of the operating itself as long as it's an OCI container. So this is also pretty much ready and just needs to be finally reviewed and be merged into SystemD. And hopefully then just works for everybody. But then again, initially, I mean, I tested with a couple of OCI containers, but before this will end up in the big distributions, it will probably require more testing for real-life containers. Any questions to that? Otherwise, next one. This one is an interesting one. This is already in SystemD. If you follow SystemD development, you might have noticed that we tried to put a lot of focus nowadays on sandboxing of system services, the idea being that most operating systems are still put together mostly out of system services. And if we add sandboxing to those, we can generally make operating systems a lot more secure because we have so many different services and they tend to be imperfectly written because they're written by humans. So system call filters have been implemented in SystemD for a while, but they were not overly useful because you had to figure out exactly the system calls you wanted to allow and the system calls you didn't want to allow and it was mostly blacklist system. So you basically told SystemD to not allow Apache to change the clock. In general though, if you do security, you usually prefer white listing systems where you, instead of saying that Apache is not allowed to do clock, you just list what it is allowed to do. This is not easy to do though because there are so many system calls. Since the latest release, we have this system call group. It's called add system service which we sat down and tried to figure out what's a good set of system calls to by default allow regular system services, right? Like which is the basic set of system calls that everybody needs and then we gave that a name and the idea is that basically from now on, people who put together system services will just enable this group plus a couple of individual system calls they need that are not in this group, which is, for example, the right to change the system clock, but other than that, yeah. So the idea is really we want to push people to do white listing of system calls by default and make that more easy than it used to be. Questions regarding that, otherwise? So the question was regarding which is the most controversial system call that is in there that we had to argue about. Good question, like I mean, this didn't come out of nothing. Like we had these groups since a while but these groups were all very like small. So it wasn't like you still had like previously you had to list a lot of these groups to actually run any regular service like for example Apache, right? So after learning from that, how this actually played out in real life and trying to look at generic services like for example Apache or Engine X which don't do anything magic, right? Like they do very basic stuff. There's nothing particularly kernel related that they do and the idea is basically, yeah, this just contains everything that you need to run Engine X but not more, right? Like so it's not enough to run an NTP server because that needs to change the clock. It's not enough to run like a network management service because that needs to be able to change the network but it is enough to run an HTTP server that just does basic file serving or something like this. But yeah, so there wasn't anything controversial it's like from any educated information like how things are. No other questions, let's go to the next one. This one's actually kind of useful as a minor thing. It's, you know, if you do service management with a system D, you always have to specify a type equals something to specify how the service tells system D about when it's ready to, when it finished initialization. Type X is actually something we should have added a long time ago. It's basically, it's a hack so that system D will consider a service ready in the moment that the exact VE that the kernel like where system D when it invokes the binary calls the exact VE and the moment the exact VE succeeded that's when system D considers the service to be successfully started up. This is sounds not surprising in its definition but it is like the way Unix is built not the obvious way how this is implemented. Previously what came closest to this type was type equals simple but in that case system D would consider a service started up in the instant that the fork completed, right? Like, and if you do our Unix developer you do know that when you start a process you first fork and then the child you exact and previously would think it was at the fork ready and now we can optionally think that that's the exact ready. Why is that interesting? It's interesting because it basically means that system D will no longer consider a service whose binary is absent successfully started because previously when it reached the exact it already thought it was started. If the binary wasn't there the exact would fail and the service start up would still be considered successful. So with this things are a little bit more debuggable but then again compatibility we can't make this the new default so you have to if you wanna use it have to explicitly specify it but it's incredibly useful and quite frankly something we should have had since always. So I don't think the mic works so I'll repeat that so the question is why didn't we make it the new default? Why we made this opt in? The reason for this is that between the fork and the exec system executes a lot of operations like dropping privileges and things and so on and there are a couple of these operations like for example for the dropping of the privileges you need to resolve a username. Resolving a username might need NSS might need IPC to another service and so you suddenly create races because suddenly something is blocking like systemdpid1 will block on an S look up before it starts other services that it previously didn't. So it's just the risk of deadlocks that we saw there so we couldn't make the default because we didn't actually try if we change it if it boots still but it's just knowledge of what NSS is a major source of deadlocks we couldn't switch this. I hope you guys followed in any way when we were discussing that. DNS over TLS you know system has this resolve decomponent that does like a local DNS caching server. Recent addition is DNS that's merged and released even is to add DNS over TLS. The logic behind that is that it appears to be the way how the DNS is going to work in the next 10 years is that everybody does this. This does basically just transport encryption but the way TLS works is that you do a TCP connection always to the central server and if you do that you always need like a local caching singleton service that actually does that because it becomes too expensive if every process would always do the TCP and TLS connection itself because there would a huge latency be involved. So this is actually I think a major step forward because it actually gives a lot of reason to use Resolve D because you start actually needing this if you wanna work with the way how the DNS is going to work in the future, right? Simply because you don't want the latency and Resolve D can give you this ability that you cache it locally and have the connection already in action so that you don't need to create it when you actually need it. Any questions about this? This one looks a little bit cryptic so what the system we recently learned is this is basically about service management when you write a unit file you already had this ability to extend the unit file by these drop-in files, right? So if you had foobar.service you could create a directory foobar.service.d and drop in a file there called something.conf and then would be read after the service file itself and could override or extend what the service file did. We slightly extended this now. The extension is like we'll look not just for the service name .d and then everything with the suffix conf in that directory but we'll also look for all dash prefixes of the name. The idea basically being if you have a service it's called foo-par-waldo.service then we'll first look in foo-par-waldo.service.d as before but now we'll also look in foo-par-dot-d. dash-service.d and then all files in there as well as foo-service.d if you still follow I probably should have put that on the slides I guess. Anyways, the long story short a lot of people when they put together their systems they usually have a lot of service that somewhat belong together, right? So I don't know, Samba for example comes with Samba and NMD and so on. They are like related services that are usually shipped together, run together and that hence you might wanna manage together. With this change here it basically system allows you to as long as you follow a very simple naming scheme that you always name these related services with something dash some suffix, right? And that the prefix is always the same. You can extend all of the service file in one go because system now allows this prefix extension. So did this make any sense to anyone here? Like I know this is kinda, wow, surprisingly many who did not understand what I was just talking about? Wow. Yeah, well this is a little bit confusing just think that this is not there so it's foo- like if you have a service called foo-par-val-dot-service then we would first look for foo-par-val-dot-service then we would remove everything after the last dash, right? After the last dash so actually without this. Dash dot. Dash dot, yeah. Then we would remove everything after the slash before that and so on so the next thing would be this foo-dot-service, right? So the dash always has to be there because the dash implies that we will do this extension checking, right? It's a very natural extension we believe because at least in system D itself all our services already were named like this and if you go through the Fedora package services you will actually realize people implicitly did the same kind of thing, right? Like where they always used some common prefix dash some specific suffix, right? So the idea is basically to make this a lot easier to extend in one go. Yeah, then you, yeah, I'm supposed to repeat the question so the question was regarding what happens if I create a directory called system D dash dot service dot D and drop something in? Yes, it will change every single service that we ship automatically for you in one go. I'm not sure if that's desirable but knock yourself out. So next slide. This is super important. It's like we realize that today's graphical terminals all support a special answer sequence so that you actually can generate clickable hyperlinks in them and that's just awesome. So I like recently prepared that PR and it got merged that everywhere where it makes sense in system D output we now create clickable links in your terminal and that is really, really nice because like for example if you do system control status now you know how the current output looks like like that it shows you the unit file that something's defined in and like all the other the drop-ins and so on these are now clickable links so in the system control status output you can just click on it now and it opens and she added or whatever you have configured to actually have a look on it, it's really nice. So I invite everybody to extend there your own the tools that you work on was the same way because it's like I mean come on links it's almost as good as emojis, right? Only problem with this is while all the current terminals do implement this like the graphical terminals less does not, right? Like the pager does not. So if you use the pager which we actually do by default because we do git style auto paging in most system new tools, yeah you're not gonna see this. So that's a bit of a limitation. But we hope that less will be updated eventually to support this as well and then it's gonna be so much better. Any questions about that? No questions about that? Something we recently did is we turned on memory counting by default. Background of this is this is basically how system we expose the C groups like the various controllers there are. Like I mean the C groups controllers are your two things always accounting and resource management, right? Like figuring out how much resources does the service use and putting limits on how much it may use. There are multiple controllers. These controllers had like different qualities in the kernel implementation and some of them were really expensive. So until recently if you turned on memory that memory controller for example for to get memory counting per service this load down your machine by 10% or something was the errors that Tijun said. We have recently turned this on by default because this has changed in the kernel with current kernels the memory counting is I mean it's not gonna be completely free but it's very close to zero for being free. So this was enough for us to say okay by default it's enabled now. We also turn on block IO accounting and not CPU accounting but task accounting. So three of the really interesting ones are enabled by default. Effectively what does that mean? It means that if you type system control status on a completely regular default system you will now see how much memory a service uses, how much processes it has. Unfortunately not yet how much IO it used but that's just an omission because we were lazy not because we didn't want to. I also heard that from the guy working on it that we probably very soon can enable IP accounting the same way by default because they managed to make the cost for getting that information per service solo that it's not like it's something we can enable by default. I'm totally looking forward to this because I think it makes service management lots more explorable because it will like you don't even have to do anything magical just tell you out of the box how much resource it takes up it's quite frankly something we should have always had but never had. What's interesting now the CPU accounting is still too expensive that's very unfortunate. I think it would actually be the most interesting one like how much CPU time actually service uses. So we're gonna have to wait a little bit longer for that but as soon as the kernel guys work on this let me know that it's safe now and it's not cost you 10% or what in CPU time just turning this on will enable that too and I'm looking forward to that. Any questions regarding that? Something we also added is IP accounting and firewalling. This is like I mean we had CPU accounting and management and we had a block IO and memory as mentioned we now also have that for IP packets. So if you turn this on which as mentioned is not entirely free yet but we wanna like it's the current people are working on making it free then a system you will track per service how many packets have been received and have been sent by each service as well how many bytes that is. I think it's incredibly useful. There's also firewalling related to that where you can basically specify IP address ranges that a service may contact or may not contact. We turned the firewalling on for all our services that we ship by default actually. For example, UDEV cannot access the network anymore which is a really good thing because UDEV shouldn't be able to access the network anymore and we went through all our services so this enabled everywhere. I think this is really an awesome feature because it basically allows you to do really service level firewalling and it's fully dynamic and it's like I mean if you do traditional firewalling on Linux with IP tables or something like this you do it at level where you look at the individual packets that flow through the network so you have no local context anymore like the packets stand for themselves you don't know really which program they belong to. With this new stuff that is resolved because you actually, this is inherently local, right? Like all the accounting, all this access control is inherently something you configure per service. So I think it's a massive step forward and it is how local firewalling should work. By the way, this guy who's doing the video implemented this part so say thanks to Daniel. I think I don't have much time anymore. Let's spend it with questions if you have any. If nobody has a question. How is firewalling implemented? Do you use suitBPF for that? Yeah, we use the BPF per C group packet filter hooks that exist in current kernel. This functionality is only viable if you enable C Group V2 by the way which means that I think it should actually be one of the major reasons why distribution should really look into turning on C Group V2 now. The container mess is kind of stopping us from that but that's a really highly political discussion that is really messy but yeah, it's all BPF and if you wanna experience this and how awesome this is, you have to make sure that you run a distribution that supports C Group V2 and like Fedora for example we can specify that at the kernel command line and it should just work then. Another question? Do you also support live reloading of rules and probably some API to change those on the fly while the service is still running? Yes, we do. Like this is exposed through all like the same way as all the other resource management properties of systemly services. So you can do system control set property, some service name, IP accounting equals yes and there you go, you have the IP accounting but you can also do same thing and then IP address allow equals something and then you can even reset it like this so it's entirely dynamic, entirely focused on the specific service but the interesting thing is also that this is available as well for slices, right? So you can actually build entire trees of this and the firewalls that you specify for a slice and the firewall that is specified for a service that is inside of a slice get merged, right? So you can do something like you said for the root slice, no traffic is allowed and then for a leaf slice, for a leaf service it shall be able to do traffic to this port and then they get merged and so the blacklist at the top gets masked out by the white list at the bottom so it's the behavior you want. So yeah, it's kinda cool actually. And it's all DBS APIs if you like that and you can with Shelly can do set property but my time's probably over. One more time and one more questions, no more questions. Okay, if you have any further questions then meet me in the hallway track. Thank you very much.