 Hello, everybody. My name is Davide. I'm a production engineer at Facebook, and we'll be talking about what we've been doing with SystemD in the past year or so. So to begin with, this doesn't work. The agenda for today, we'll start with a quick recap of the story so far. We'll talk a bit about our progress in tracking upstream internally. We'll discuss a few instances where we were able to leverage SystemD to do something especially interesting. And finally, we'll close with some case studies and interesting stories. Without further ado, we have a lot of machines. We have hundreds of thousands of machines and seven-digit numbers of containers. All of these machines run CentOS, and I'm on the operating systems team. My team is responsible for maintaining CentOS on this infrastructure, and in general everything that's related to that. So we maintain packaging, the configuration management system, and in general, we provide, we maintain the bare-metal experience. We provide a platform that other services run either directly on bare-metal or run on top of the container platform, which runs on top of bare-metal. So we've been on CentOS 7 for a while now. When I came here last year, we still had a few containers still running on CentOS 6. Now everything is on 7, both host and containers, which also means that by now, pretty much everybody had exposure to SystemD for the good portion of two years, if not more. And I mean not just people on my team who are working closely with the systems, but pretty much every engineer that is deploying a service or dealing with a service directly. And the other thing we did was we, myself and my team, ended up traveling to pretty much every engineering office, giving trainings, giving talks, trying to make sure people were aware of what SystemD was, how to interact with it, where the documentation was, how to ask questions. And in turn this has led to a lot of people reach out to us and try to either leverage new features they found in SystemD, they thought were interesting, things they read online, they thought they could use, and wanted in general to integrate more tightly. And I'm going to try and cover a few of these later. But before I do that, let's talk quickly about our progress so far. So we run CentOS 7, but we don't run SystemD from CentOS 7 because CentOS 7 ships with 2.19. We backpour SystemD from Fedora. Last year we were on a mix of 234 and 235. And I'm happy to say that we managed to more or less stay in track with upstream. So we managed to go through 235, 236, 237, 238. And these days we run 239 on the vast majority of the fleet. And generally speaking, we would do these roughly in sync with upstream. So we would be running whatever is the latest stable or the stable minus one, depending on whether that's like pending issues. We don't backpour just SystemD. We also backpour things related to SystemD, notably utility Linux, which happens to be our untimely dependency, and the build stack because we do need to be able to build SystemD. These backpours are taken, we take the source RPM straight from Fedora, munch them a bit, and then build them internally. We publish the backpours on GitHub on that repo. That has been around for a while now. So if you happen to need to run this stack on your system, you're welcome to use that. Doing this for a while, we've now become fairly comfortable with some workflows. From a development standpoint, we found that by far what works best is following the same playbook we follow for the kernel. So engineers, if they need to develop patches on SystemD or develop features, they will do something master, they will send a PR, they will go through the normal PR review process, and then we can backpour it internally. In the meantime, to test it, and then once it's released, we can backpour just the straight commit. This works a lot better than the opposite process, which will be developing things internally and then upstreaming them, because the upstreaming part isn't an add-on, it's just something that comes natural that you do right at the beginning. By far, I think this is the best approach when dealing with open source projects and contributions. From a release standpoint, we control the SystemD rollout using Chef. We found that a process that works best here is when we're rolling a new release to start with a small set of machines that are like hand-picked canary machines from various teams, so these teams can get the first exposure and give us feedback, especially if there's some new features that might be affecting them. Then later on, we start rolling from 1% to 2%, fairly quickly up to 50%, and we found that most of the time we would find issues either in the initial canary stage or when we're at about midpoint. It's not that common to find issues between one and midpoint, because that's when people start actually noticing that they and they get significant exposure. The way the rollout works with Chef is pretty easy to move back and forth, so we have the ability to roll back if the need arises. Now, I mentioned we are almost everywhere on 239, and as always, that's a long tail. I got these numbers fairly recently, but we are about 94% of the feet on 239, 4% is between 238 and 235, and a lovely 2% on 234 and 233. This is kind of annoying, especially when I get someone sending bug reports, and the bug probably is like, oh, you're on 233, yeah, we're not gonna fix this. You need to upgrade. The main reason for these long tail is kernel upgrades, and now kernel upgrades wouldn't normally be a blocker for SystemD, but unfortunately last year, we went into a fairly entertaining issue where both SystemD and our container manager were poking the TTY subsystem in the kernel in a way that made it not work very well, and it resulted in PID 1 completely hanging and being useless until we rebooted the machine. Now, this has been fixed, and is the microphone working? Oh, okay. This has been fixed for 16, that's the commit, the fixation actually wrote the patch and abstained it. But unfortunately, if you're not running that kernel or a kernel with that patch included, you're out of luck, you need to update the kernel first. And a lot of these systems that are on the long tail are systems where updating the kernel requires a reboot, and a reboot means a downtime, and if you think about things like, say, network switches, if you reboot a network switch, the whole rack loses connectivity for a while, so this needs to be planned. So that's why we have, we still have a long tail. I'm hoping we can get rid of this in the short term. There's generally a lot of effort in automating kernel upgrades and being able to do this more quickly. The kernel team actually gave talks in the past about this process, if you're interested. Another kernel-related thing we hit was a bug in the networking stack that was initially related to private network, yes. Basically, when you add private network, yes, enabled on some services, you would hit our accounting bug in the network layer that led to the process that was using private network, yes, end up in this state and just stick around forever. Not always, but, unfortunately, system enables private network, yes, by default, for things like host AMD. Host AMD is spawned every time you run host MCTL, and we have a plugin in Chef. There was host MCTL at the beginning of the Chef run. It's a plugin in Ohio, actually, not in Chef itself. So every time we run Chef, which is at 15 minutes, we would run this. Sometimes we would end up with processes in this state. This tend to pile up, and it's not great when you have a lot of these. So we also fixed this. We mitigated this in the beginning by just disabling private network, yes, on the effective services, and then we found that this was actually fixed upstream already, who just backported the patch. We had a couple of other, like, minor issues related to upgrades. A fun one was when we accidentally rolled out a version of system D on a bunch of machines, because YAM was rendering an upgrade transaction on ASI 4 in a way that also upgraded system D. That was interesting, especially because that happened on machines that were also affected by the TTY bug, so we suddenly had to reboot a bunch of boxes that we would rather have not. Finally, when we built system D in mock, we had to disable a couple of the tests because they were failing. Actually, before writing the talk, we set about to fix in these, and we found there was already a PR upstream fixing the tests we cared about, so that's nice, because you don't have to worry about that. Now, that was mostly about system D running on machines. The other side of the story is that if you write software that wants to integrate with system D, you need to link to system D itself, so if you want to use, say, the SDNOTIFIOS, the SDBUS API, and a Facebook for software that doesn't run on the operating system itself, not for system software, but for the application or Facebook-specific software, this is built using our own internal toolchain, so we have our own, like, GCC and Glibc and France, which also meant integrating system D and lib system D inside the toolchain. We already had a version of the lib system D there, which was 2.3.3, but that became untenable when people wanted to actually use things like SDBUS, which were not available in 2.3. Updating this was quite a bit of work, because at the time, 2.3 was still using auto tools, so we had to port this to Maison, we had to also make Maison work in this system, and then for reasons I won't go in detail, our system relies heavily on static libraries because everything is static and linked together, and the building system and system did not produce static libraries at all, and so we ended up fixing this and sent the PR, and after a bit of work, this ended up working. The benefit of this, though, was that once all this work was done, going from 2.3 to 2.3.9 was trivial. It was literally five minutes of work and running a build, and it was done, and I expect in the future it will be a similar story. A bonus side effect of this was also that we were able to get rid of a bunch of things that we had around NSS, so on a system, ETC NS switch is what tells you what NSS modules you use for given operations. Now, if you build things using a separate toolchain, it also fetches NSS modules from that place, and some of these modules were not available in 2.3.3, so we had instances where in NSS switch, we were setting something to use, say, NSS host, they were NSS my machines, and we would get errors when running some random Python programs, because we would try to load modules that didn't exist. And we had workhounds for it. It wasn't a big deal, but it's best if we can get rid of this ex when you can. All right. Now, let's talk a bit about some cool stuff we found. I want to start with this, because this is a feature that's not very well known. I found that not very well documented, but it's really awesome. So this is how to do zero downtime restart. If you have a daemon that you want to be highly available, and even during updates, you want to be able to restart or update a daemon in a way that doesn't affect ongoing connections. So here's how you do this with systemD. You have your old process and your new process. Your old process double forks and starts the new one. It then uses SD notify to tell systemD to update the main PID. And the main PID is what systemD will supervise and treat as the main service that it should keep alive. Once you've done this, the two processes can just figure out on their own how to handle the transition, like they could use signal, one could just keep handling all connections, and then self-terminate, this is up to you. Something else you can do is also use the file descriptor store. So if all you need to do is store some file descriptors and pass them along, you don't even need to do this in-flight communication thing. You can just push the file descriptors up into systemD using the FD store facility, and since they will keep them safe for you, and then you can fetch them back using that function, and they will be just available for you. This is a really cool feature that I found not a lot of people know about, and it works really well. We've used this internally in several cases, and it's been pretty great. Also, a nice side effect of this is that it automatically makes you make your service into a type notify service, which is also something that people don't necessarily want to do on its own, because it requires linking to systemD. But that is nice, because then you have the guarantee that when your service starts and systemD marks it as started, it is actually started, because you can use the SdNlify API again to tell systemD that I'm starting, I'm starting, now I'm ready to take connections. So all in all, this gives us a much better way to write resilient demons. Shifting gears a bit, let's talk about resource management. I want to spend a lot of time talking about this, because Tejun and Ioannis had an awesome talk yesterday about all the work we're doing with C Group 2. If you didn't see it, you should watch it. You should also check out Daniel's talk on UMD. All of the features they talked about are either already released in systemD, or have been upstreamed and will be released. Notably in 240, we will lend support for memory admin and IODL latency. Roman is also working on support for the device controller for C Group 2. The device controller was a C Group 1 specific thing. There's no real device controller in C Group 2, but Roman is working on a BPF-based implementation of this that will provide the same API from a systemD point of view. There's a PR up for this that is currently in review. Finally, on this subject, if you happen to deal with container managers or write container managers, I highly recommend you to read the C Group delegation document that was merged in systemD a while ago. This document codifies a lot of conversation that we had over the years on what's the best way to do things, and it makes it a lot easier to understand all the tricky points and interactions that you might have to deal with if you're writing a container manager or anything really relying on C Group's delegation with systemD itself. Something else I talked about in the past, but that is now finally open-services-pystonD. So, PystonD is a library in Python that uses Python to wrap the SDBus API, and in this way, you can talk to systemD and interact with the systemD DBus object model from Python. This is something that Alvaro from Instagram brought. He's actually going to give a talk about this later today, so you should attend that if this sounds interesting. This is something we'd be using internally quite a bit. A lot of our infrastructure code is written in Python, and because this is something that only uses systemD and SDBus, it's very easy to use and it's very reliable. Something we are working on that uses this directly is something called systemDemon, which is a small demon that fetches service metrics from systemD and fills them up to various monitoring systems. This is something we'd be working internally for a while. I'm hoping to get it open-sourced sometime this year, after it's had a bit more of a wider deployment. On the containers front, Lindsey had a talk yesterday on containers, so I also won't go in detail here, but the short story is that we are trying to leverage systemD more and more in containers, both within the containers and outside the containers. Running systemD as P1 inside the containers gives us the ability to do proper supervision for services there. It also gives us a P1 that is better than busybox, and in general it can deal more reliably with things like dying children. Using N-Spawn both as a container engine and for building container images gives us a solid platform for dealing with all of the tricky bits of interacting with the system that we don't have to maintain ourselves and that is in line with what everybody else in the community is doing. And finally, we started looking at using portable services, because portable services gives us a facility for doing, for composing services together, and if you can think about this, we can have things like a single Facebook service, say a demon for doing service discovery, bundled up in an image, and then the same image can run both on a bare metal host and a container, as it is, using portable services. This is something that we are starting to explore now, we're hoping to have more done in the future. Finally, a few words about logging. I said in the past we don't leverage the journal that much yet. That is mostly still the case, most of our fleet. We run journal D, but we kind of notary to mostly feed everything to Syslog. There's a lot of work that is being discussed on ways we can make the journal work better for us. The main thing that is missing right now is being able to have per unit settings, so we can control limits and rates and things like that. We've already sent a PR to control some compression settings for the journal, because the other concern is the IO, the IO usage there. For services not used in the journal, one thing we found a few months ago was that standard output would truncate files by default, and a team that was using these that need to support a PAND, so they ended up just fixing it themselves and sending a PR stream, which was nice. We're hoping to do more work in this space and to have more to talk about in the future. All right, now let's talk about some horror stories, or well case studies. So the first one is something fun that happened on our database fleet. Now our database fleet is a bit special compared to other machines at Facebook. The database fleet still runs on C Group 1, and the database fleet runs with VM does swap in a zero, so they don't want any swap. If there is swap on this machine it's bad. And something that happened there was that they pinged me showing me that graph, and I can't put taxes on that thing, but you can figure out one is time, the other is swap usage, that is bad. And that happened to correlate exactly with when we're all those system D238 on their machines. And if you read the release notes, 238 enables memory accounting by default, and memory accounting means that every service gets its own little C Group with its own C Group specific memory settings. Now with C Group 1, one of the setting is swappiness. This is something that's only in C Group 1, it's not in C Group 2. In C Group 1, every C Group has this memory does swapiness setting, which is like VM does swapiness, but for a C Group. And in this case we found that on these machines, all of the slices add memory swapiness set to 60, which is the kernel default, instead of zero, which is what we wanted. So that's why we were getting this nice graph up. And this took a bit of digging to find out, but we turns out this is of course inherited all the way up and it's inherited from system slice. System slice gets created very early on, and what changes the setting is a system CTL, and system CTLs are applied by system DCTL.service, which runs after system slice is created. So we have system slice created with VMs swapiness 60, because that was a default, then we changed it, but by then everything had inherited from there. So we found an issue about this. We also fixed this. We'd have service override for Scuddle the service, which is basically fine on CZF as a C Group to override memory does swapiness. Now this code doesn't actually work, because you have to do this breath first and not death first. So this is the actual code that works. I do not recommend doing this, but if you hit this kind of problem, you can maybe use the same solution. Another fun issue we have with C Group 1 was an explosion of zombies on our Gitmaster. Our Gitmaster was also running C Group 1 and was also not using PEM system D. And that means that under SSCC service there were a ton of different processes, and it needs to get server, so it has a lot of short lived processes. And when we found the box at a very high load, we found that P1 was not ripping children at all. So we had thousands and thousands of zombies on these machines. And it turns out when that happens, even stuff like PS hangs, because PS these days links to live system D and talks to system D to get information about user sessions. So this was great. It took a fair bit of poking. We initially suspected D-Bus problems. It turns out in the end, system D itself, the way it does weight bid and processing SIG childs, before it would process the SIG child, it would run an aliveness check on every process in the slice by calling kill zero. And if you do this for every process, wait for the results and then process the SIG child. And if you don't do this fast enough, you keep getting more and more zombies and you never recover ever again. So one of our engineers came up with this also a terrifying way of fixing this, which was using P-trace, which is something called P-trace do, to inject weight bid into system D and trick it into calling weight bid every second or something, which fixed the problem. Then Leonard actually fixed the algorithm here. And this is a code that runs only for SIGRub1. So it's actually a pretty good example of something that a code path that we were not expecting to be triggering that we hadn't been looking at, that we ended up triggering just because we were running SIGRub1. Finally, a much simpler case, but still fairly entertaining. We had machines that had filers from Luster. And for reasons I won't go in detail here, we do this using NFS, but we do this using NFS in user space. It's an open source thing. And NFS in user space uses Fusey and the way Fusey works, when you mount something after you run mount in the command, you end up with a lingering process to manage the mount, because, well, it's in user space. If you do this from Chef, and you from Chef you call bin mount, well, Chef calls mount, and then the process sticks around. But Chef, in our environment, runs as a system disservice. So Chef runs, completes the run, then the SIGRub stays there with this process. Unfortunately, the service has time on stop sec 15 minutes, because we want to run Chef every 50 minutes. So after 15 minutes, the mounts would go away, which is not exactly what you want. This also took a fair bit of poking. It turns out there's a very simple fix, which is just starting the mount unit instead of calling bin mount, because then it will get run in its own SIGRub. And you can see the code there. This happened to be into one of our open source codebooks. But yeah, I hope this gave you an idea of like a few entertaining corner cases we hit. And I always like hearing about these kind of stories. So if you have stories like that, please do share them. And with this, I will go with questions if there's still time. Yeah, one minute, questions. Yes. I can repeat it. It's fine. Yeah. Yeah. So the question was about the settings like the CCDLs. One solution, another possible solution is doing this in the NFS. And the question was whether we consider doing this. We thought about it. But in this case, on one side, we needed a fairly quick mitigation for the problem. And on the other, it isn't all of these things, in our case, are configured in Chef. And we really didn't want to have two places where they would be configured, because there's already a fairly detailed API and people are fairly used to setting these things in Chef using the FPCDL codebook. So while we could add this also to the NFS, that would mean people that need to change it in yet another place. So I think it's something we could do if the need arose, but I would rather not do it if we can't avoid it. But yeah, that is definitely a possible way to mitigate this. One more. Yeah, can you repeat the question? No, I don't have to. So after the, when you are talking about the switch over to a new instance of a demon, to an internal restart. Yep. Why not? I mean, I understand that this works. But why not serialize the state to MMFD, pass the MMFD to the FD storage and then stop the service and restart it from there? Yeah, that's definitely an option. It depends on the type of service. Like sometimes if you have, say, in flight, in flight connection that you need to keep handling, it's easier to have the old process finish that and then have the new process to take new connections. But yes, that is totally an option too. All righty. Looks at him all the time. Thank you very much.