 Hello, everybody, and my name is Davide. I'm a production engineer at Facebook, and I'll be talking for a bit about what we've been doing with SystemD for the past year or so. I've given versions of this talk before, and hopefully every year there's something new that's interesting. I'll start with a quick recap of the story so far. I'll talk for a bit about what we're doing for deployment, and how that ties into the development workflow we use for SystemD. I'll quickly go through a few new features, and I'll try to close with some case studies if there's time. So without further ado, as I said, I'm a production engineer. I work on the operating systems team. My team is responsible for maintaining CentOS on the Facebook fleet. We have a lot of machines, as you might imagine. We have a lot of physical machines. All of these machines run CentOS. All of these machines run SystemD. We're on CentOS 7 on the fleet. We're starting to prep for Cent8, but right now everything is on Cent7. And by now we've been doing this for a while. We've been running SystemD for at least three years on a wide scale. And it's said to the point where it's pretty much everywhere. It's been quite interesting seeing internally how things moved and how people reacted to it. When we started doing this, people were fairly skittish. We had to do a lot of work explaining people why we were making the effort to move to SystemD when we're doing CentOS 7. And now it's at the point where it's the opposite. We have people reaching out to our team and to other teams fairly frequently with ideas they have for new features they want to build that might tie into SystemD or how they might leverage new SystemD features for what they're doing. And at the same time, we've also started doing a lot of development ourselves around SystemD and its ecosystem, contributing both to SystemD proper and to tools around it. And I'll go over some of these. So how do we get SystemD on the fleet? We deploy SystemD with Chef from RPMs. We don't run the SystemD as in CentOS. We build it from GitHub because we want to be able to track what AppStream is doing. So at the end of last year, we were on 239. When 240 was released, 240 was a pretty big release. It took us quite a long time to qualify, so we ended up skipping it for deployment. We were on 239, 241, and then 242, which is what's running now as of today on 98-ish percent of the fleet. We started playing with 243. It's not in wide deployment yet, but we have it running in some places. That's probably what I'm going to start working on when I get back from this conference. The back port we use is based on the Fedora packaging. You can find it there if you're interested. In general, these processes work pretty well. We've been doing this for a while. We don't have any major issues with it itself. The main pain point here is that the long tail is annoying, and the long tail is pretty small, it's 2% of machines, but when you have a lot of machines, 2% is still quite a bit. And there are too many reasons for the long tail. One reason is that sometimes you just have broken machines, and broken machines sometimes don't run Chefs, sometimes there are PM databases corrupt, sometimes things happen and systems don't get updated, and for one reason or another, they stick in production, and it takes a while for them to go away. I don't care about those that much, because eventually they'll go away, and they're broken, so who cares? The thing that's more annoying is that sometimes when we do a release, we have to put in place exceptions because we will find, either we will find a change in upstream or a bug or something that affects a specific customer in a way that they can't quite update right now, and at the same time we don't wanna stop the whole rollout just for them, so we'll pin them to the previous version and then we'll go on. Or sometimes we'll find that something changed and something our customer was doing was either wrong or doesn't fit quite well with the model. So you end up tracking these around, we have four or five of these in place right now. I think the oldest goes back to 2.39. We are fairly diligent at cleaning this up, but sometimes you have to deal with that. Now, as I said, the release process works fairly well, but it does take a while. Oftentimes, from the moment when the system, the upstream, cuts the release to when we deploy it in production, the actual manual work of prepping the RPMs for testing is a couple of days of work maybe, but it can take quite a while from getting it to the point where we feel safe rolling it on the fleet. Part of the reason here is because when we go from one release to another, we don't generally do much testing with what's happening in between. We will follow what's going on upstream, but we do deployments on the fleet between major releases. So there can be quite a lot of changes that accumulate and that can lead to last-minute surprises for people. The other thing is that there's basically two people doing this, which is me and Anita, so if one of us ends up under a bus, that's not ideal. What we'd really like to be able to do is do the development and testing concurrently. Most of the time when people do feature developmental systems, they'll do it on master, then they'll end up reporting the patch internally and testing it on whatever release we have deployed. This isn't too bad, but it is friction. We would also like to be able to do more and faster integration testing, having better ways to find issues early on and have a fast-figure process, both for our developers and for upstream developers. So I started looking at what we could do there and we ended up building a little CICD pipeline for this. This is not open source, mostly because it ties into internal stuff. It's also not particularly rocket science. We take the Fedora packaging. I have a horrifying shell script that replaces the tarbol in there with a tarbol made from Gitmaster. It runs every day at 10 a.m. It builds the RPM, it runs the test suite as part of the build. If that passes, it deploys the RPM on a small number of machines and we have a daily running on a small number of machines. We are working right now on getting this hooked up also with the container testing infrastructure because one of the main customers we have of SystemD is the container infra. So this way, we can find issues early on. Now, this is something pretty simple and yet these letters already find a significant number of issues way before our release. And when we cut 243, when we cut 242 and later 243, this was a lot faster. We had this running since March. It led to filing maybe about 10 GitHub issues between GitHub issues and PRs for various things we found throughout it. One thing I want to add soon is integration testing of bare metal as well. I've also started looking at there's a test suite on GitHub that Red Hat uses for doing the CI based on CentOS hooked to the upstream repo. I want to start looking to see if we can run those tests internally as well to have better coverage there. So that's on the deployment side. On the development side, as I said, we would like to be able to do faster, faster iteration and faster development. And we would also like to be able to leverage the internal tooling we have for doing code review, for doing CI. Right now, the way people do code review for system D changes tends to be, they make a pace being equivalent of what their patch is and one of us looks at it, which is not ideal. We also already know how to do this, because if you look at it, this is kind of like the kernel development process. So the current plan is to basically do what the kernel team does. So we are putting together an internal system repo that will be just a read-only mirror of what's on GitHub with the same branches, same tags. People will branch off master for future branches, so if they're working on a thing, they branch off master, work on their, mirror their PR, hopefully they'll build a thing, cut from it and test it before they make a PR, but you know, this way, at least they can get signal on what's going on. When we make releases, we'll branch releases off, release, after they release tags, cherry pick from the future branches, cut the release. This has also the benefit that we can get rid of the hairy pile of patches we use right now with the RPM packaging and we'll just have a simple script to grab patches from the Git tree. This is exactly what the kernel team does. We think it might work and make life easier for us and hopefully lead to having better and faster feedback outside as well, but we'll see. And I'd actually be interested to hear if other folks here do internal development for system D, what development process you use, or if you build tooling around this. Now, let's go over quickly a few new features that landed recently. I'm not gonna spend too much time on this because there's been a lot of other talks from Facebook people on things and I'd rather have them talk about what they work on. Once in it hasn't come up yet, that is pretty cool and ended up in 243 is that condition. Exact condition is something that Tanita developed. It's kind of a hybrid between condition and exact start pre where it run a commands before the unit is started, actually before the prescripts run and depending on the execute of the command, it can pass, so it will keep running the unit, it can fail, mark the unit has failed or it can skip execution kind of when a condition fails. Now, why would you want to do this? Well, I can tell you why we want to do this. The reason we want to do this is that we want to gently nudge people to do continuous deployment of their tools. So we want to have a tool that checks the binary and if the binary is too old, just refuse to start the service and then do a bunch of other things but that's one of the main reasons and this is a fairly simple and straightforward way to do it. There's a possible improvement where we could maybe have a percent specifier so we don't have to copy the name of the binary there but that's just sugar. This is actually a pretty good example of I think feature development that goes well because we came up with the idea of maybe we should do something like this before Christmas internally. We discussed it, we played with a few ideas. We were in Bernot in February for DefConf, we met with assistant developers, we discussed this with them, we brainstormed on possible designs and we ended up with, oh yeah, doing like this seems to be simple enough and it could work and then it was coded in the months afterwards and it landed in 243. I think this is pretty much the ideal way you want development to go. On the resource management front, there's already been several talks. Teju and Dennis talk cover resource control in general. Dana and Anita talked about UMD. Ioannis is gonna talk later today about Senpai. Two things I wanna raise, there's disabled controllers that landed for transient units as well. Disabled controllers is quite handy because it allows you to turn off specific controllers without having to rely on kernel command line flags so without having to reboot the box. The other thing that landed is a number of UM specific control for the kernel UM, not the user space UM around C-group 2, notably UM policy, so you can apply UM settings to specific C-groups. Something else we've been working on for a while is Python D. Alvaro did a Lightning talk on this yesterday. So Python D is available there. It's a thin-size-on wrapper on top of SDBus. It wraps the SDBus API with the idea of making it easier to interact with system D, but it also allows you to pocket D-Bus in general. Right now, this supports pretty much all the D-Bus properties exposed by system D. It's been working quite well. We've been very happy with it and we've started building quite a lot of tooling around it internally. I would like to see this more used in general because at least in my experience, it's one of the most stable ways to interact to D-Bus from Python. Once in the landing recently, it's also socket support, so you can do fairly neat or terrifying things, depending on your point of view, where when you, that's just an example, you can import to try it, but what that does under the hood is that it makes a transient socket, makes a transient service, then forks off and a little Python web server and it ends up being managed as a proper service with a proper socket, which is nice. I mentioned before, we have for config management. We have a code book called FB system D for managing system D, some GitHub, it's been there for a while. We, there was quite a bit of work on these in the last half or so, mostly internals. One thing that's interesting is when you write on this system unit, you generally end up doing that using templates. Using templates for managing overrides is what we were doing before. It's really annoying because you end up writing the same boilerplate code which is make the directory, make the template, then delete the directory, clean up the template, reload system D is obnoxious. So I wrote a little customer resource in Chef that lets you drop an override and internally it figures out where should it go, it cleans it up when it needs to be cleaned up, it did, reload system D when it needs to be reloaded. This is pretty useful and it's straightforward enough and the syntax is about the same as the upstream system D unit resource in Chef. Then one more thing that Chris down has been working on. Lately is a linter for system units. When people use system D, they find about a lot of features that system D has and they start using them. Some of these features are great. Some of these features, we'd really rather than not use them. Like one example is people using kill mode equal process and not really understanding what it does or people using interesting settings for namespacing without really understanding them. So all of these are things that are well suited for linting. There's already a bare bones, it's not really a linter, it's more of a consistency checker built into system D that's analyzed. This is meant more of a general purpose. Linting tool where you can define a policy for the things you care about and then it can surface them. We have this running internally, it exists and it works. We would like to open source it by the end of the year. It's a standalone tool so there's nothing Facebook specific in it. Hopefully people will find it useful and maybe it will help other companies and folks prevent issues there. Okay, I have a couple of horror stories on the same theme of implicit dependencies. So on this one we had a bunch of machines where after we rolled out a new system diversion, I don't remember if it was 241 or 242, we started seeing that NTP was not starting on boot. After considerable digging, we discovered that the NTP service in CentOS uses private temp, which is fine, except private temp internally takes a dependency and implicit dependency on temp.mount, which I did not know and only found out after digging through this. This would be great except on these machines for because of the way they were set up, people didn't really notice that the way they were set up was that they would boot, temp mount would start and do this thing, but then we would mask temp.mount in Chef. So you would end up with this unit that would be both active and masked, which is probably not something that's supposed to work. And in fact, while this works in 239, in the sense that by works I mean it doesn't complain, from on later versions, system view will hard fail and refuse to start a unit that happens to have a dependency on a thing that is both active and masked. So yeah, this was not great. This was actually one of the things I mentioned before where we had to pin this fleet to 239 for a while while we figured this out. It took us probably a week working on it on and off because it wasn't really a showstopper. We could keep it running just fine when it was back on 239. So yeah, implicit dependencies are kind of annoying. We had another case like that where we had some hosts where some directors just were on their own boot. After digging, we found that the temp file was not being created because the temp file setup never ran. And since the temp file setup depends on local FS, it depends on swap. Except one of my colleagues was working on this code book called FSwap to do encrypted swap and a bunch of other things. And he added a whole bunch of masks to the unit as the dependency of swap.target, which makes the whole thing fail so the entire thing gets prone, it doesn't exist anymore, and you end up with no temp files. The way we debug this, by the way, both of these problems, but these ones specifically, was you use system.deanalyze, system.deanalyze just to really hand the commands, plot and dot, one will give you a butcher-style plot of the boot, the other will give you the octopus-style dependency graph. The dependency graph is completely useless in our case because it ends up being this gigantic thing, but you can tell it to only show you a subset of it, which is very helpful. You can also enable debug logging in PID-1 either by killing it with a special signal or by passing a command-line flag. This is actually how we found this out because we ended up seeing the debug message that said these three is getting prone because this thing doesn't exist anymore. And that's all I have. Questions? You mentioned that you are starting to think of debugging system.deanalyze, like debugging the kernel. Developing. Developing the kernel, but also, I guess developing involves debugging. Can you elaborate a little bit more on that analogy between kernel-level stuff? Are you bisecting things? Are you the whole range of what that would mean? So I meant that primarily in terms of development workflow. More than specifically on the debugging side. In terms of development, we have a fairly well-honed process where we have an internal kernel tree. We use exactly that, the feature branches and the release branches model. We have automated testing for the kernel. We have CI. We have automated deployments. We have very good way to understanding either via CI or via AB testing whether a specific change is going to impact things and how it's going to impact them. And we would like to bring that to system.de as well and eventually to other system software we work on. For debugging-specific issues, I mean it ends up becoming... It's very different, actually, I think, than developing the kernel in the same senses. And it tends to be very hit or miss. What I personally do is play with the debug logging, play with tools available on the box, surprisingly oftentimes tracing PID-1 ends up being very useful for understanding what's going on. We had a couple of cases when we ended up with PID-1 that locked or in bedstays. I talked about them last year, actually. And in those cases, tracing was how we found out what the hell was going on there. We had interesting and tricky interactions between kernel and user space sometimes, especially when there's API mismatches or things that change. But I say that doesn't really happen that often nowadays. We got pretty good at that. I would assume that at Facebook you are logging a lot of stuff. Do you use JournalD or do you have a different logging system? Are you logging remotely or locally? We run JournalD on every machine in the fleet. We run, by default, we run JournalD with a 10 megabyte volatile journal. And then we run RC's log on all the machines and RC's log ships the logs off somewhere. I don't know what happens afterwards, actually. There is a lot of collection infrastructure managed by the security team for that. The reason we do this is because people really like being able to grab VALUG messages and stuff like that. And there's some amount of tooling that also does automation sometimes based on that. We've been trying to move people towards the journal because usually what happens is that they give up training internally on this and people realize the journal is nice and they would really like to use it and they find out all the journal city documents they could use and then they ask me if they can deploy it. It's kind of difficult to have both the journal and RC's log running concurrently because you end up double writing and causing a lot of extra IO. And we have applications that can be extremely chatty in terms of logging. So if you end up writing three gigabytes a second of logs that's not great. One thing that will help, we've been looking at ways, right now the journal is a bit all or nothing endeavor because either it's entirely volatile or it's entirely on disk. We've been looking at ways that we could have a per unit setting here and that would have transitions things over because in a lot of cases it would be one application team that would really like to use the journal but maybe everybody else not quite yet. I think this is one of the things that I might play with when we start rolling out the next CentOS release and see if we can couple with that rollout maybe not gym people to use this more. Have you looked into like using our SysLog and GE because it has modules to slurp up the journal D and as well as SysLog entries and other things together such that those who want grabbing our SysLog they can and those who use journal they can and you're writing once and storing once. Well, you can do that, but you still end up writing twice though, don't you? Sure, but then when you run Journal CTL you only get the small buffer you have because that's kind of the problem there, yeah. We actually have our SysLog setup that way right now. I think it uses the module to get the IMK journal or something, yeah. Yeah, that's kind of the problem because if you only have a small volatile buffer then if you actually want to use journal CTL or if you do system CTL status on something you see no more logs and that's a bit annoying. Regarding the linter tool, so I'm wondering since you mentioned it's more than just linting maybe analyzing too. Why not integrate it into system de-analyze and then if not, why not and then also then with the versions of system de-changing and the unifile being different, how do you manage that? So some of the things that we wanted to do in the linter actually ended up in analyze. One thing specifically was parsing time specs and validating the time specs like we do on calendar or something were correct. That ended up being implemented as a feature in analyze itself. I think implementing the whole linter as part of analyze is something that will probably be possible. I think it will maybe be kind of out of scope for analyze itself, especially because we want to specifically this to be pluggable in terms of policy and have maybe the ability to have more complex policies. I don't know how well that will fit there. Chris is actually there and he can answer that more usefully than I can most likely. Thank you. So another reason is because if you look inside the system de-source like one of the things we have is kind of, we have a lot of little bits of unit state. We have a lot of little bits of like go look for this thing and store it somewhere. You don't really have a way to kind of homologate that back to here is how the unit file looks or here is how, you know, you can get something in system CTL show, but it's hard to map that back to like here is how we got to that outcome. And that's often the thing we need to know is like how do we get here? So currently it's hard to just make it part of system CTL analysis because I mean you would, you can do it, but you would again have to kind of re-implement a unit parser and do all kinds of other stuff. And I think it doesn't make a whole lot of sense. That thing is, thank you Daniel for your talk. We got a quick five minute break. Sorry.