 All right, let's get started. Hello, everybody. My name is Davide. I'm a production engineer at Facebook. And I'll be talking about what we've been doing with SystemD for the past year. So before we begin, here's the agenda for today. I'll start with a quick recap of the stories so far. We'll talk about how we are keeping SystemD updated in our fleet, and how we are tracking upstream changes. We'll focus on a couple of things we've been working on lately around resource management and service monitoring. Then we'll discuss a few case studies that hopefully showcase a bunch of interesting problems we've seen and close with a couple of words about advocacy. So I was at SystemDConf about a year ago. And at the time, we were moving the fleet from center six to center seven. My team works on the operating system team. My team is responsible for the bare metal experience of the fleet. So we maintain what keeps the physical machines at Facebook running. We maintain the operating system, which is sent us. We take care of packaging. We take care of configuration management using Chef, among other things. Our fleet is made of hundreds of thousands of physical machines spread around various data centers in the world. And all these machines run centers. And that's what runs the website and everything else. So about a year ago, we were moving from six to seven. And on six, we had five or six different crazy ways of supervising services. On seven, we have SystemD. And I'm happy to say that now we're everywhere we're running center seven and everywhere we're running SystemD, which makes me personally very happy because we managed to get rid of all of these crazy ways of doing service supervision. And as part of this, we got to migrate a lot of services to SystemD. And we got to see a lot of people that started building their services with SystemD mine and started leveraging more and more features. And one thing we also need to help is we started integrating LibSystemD in our internal build system so that people are now able to use features like socket activation and all these features directly in our demons that make people's life a bit easier. This talk will focus primarily on the bare metal. Jill and Zolton are going to give a talk later today about containers. So if you're interested about SystemD in containers, I recommend you attend the later talk. So let's talk a bit about how we are managing SystemD on the fleet. So in general, at Facebook, we manage machines using Chef. And we have a system to do package updates on machines in a controlled fashion. So we can say, this package has been to desertion. And then 1% of the fleet gets this other version, and 2% 5%, and so on and so forth. And that's the system we use for managing updates of system packages. So about a year ago, we were on SystemD 231. And we went from 31, 32, 32, 33, 33, 33, 34. Right now, we have about half the fleet on 233 and the other half on 234. And I'll explain why in a few slides. We also started testing 235. We, in general, we try to run, we run Central7 because we want a stable base. So we want to be able to pull security updates in an automated fashion. But we also want to have a modern user space. So we backport a lot of core system components from Fedora, from Fedora or Hyde. So these components are SystemD, of course. But it's also a lot of the ancillary ecosystem. So we backport things like D-Bus and Util Linux and Proc PS and a lot of these basic system tools. So the experience you get using the system is somewhat similar to the experience you would have using a Fedora system, at least from the point of view of a developer. And we publish these backports on GitHub. And you can get them on that GitHub org. These are just the spec files and whatever patches we have. SystemD there is on 235. This is actually something that came out from last year's conference because people were actually interested in this. So we made an effort to get them published. And then Yana Reddit was kind enough to make a copper from these packages. So if you happen to run Central7 and happen to want a modern SystemD, you can get binary packages directly from there. And those are also mostly up-to-date. And if you need the earlier versions, you can go back in the GitHub history. Now, of course, having a lot of machines, the updates aren't always smooth and things can happen. And I'm going to go over a few interesting things that happen during package updates that might make things a little bit more interesting. So the first thing is not actually a system-specific thing. It's just something that happens in general when you are dealing with a large fleet of Reddit systems and have to update packages on all of them. And it's like issues generally around RPM. So machines can get in bad states for various reasons. You can have issues around power loss. You can have people running kill-9 on things, or processes running kill-9 on things that can leave the system in weirder than interesting states. And this can result in found situations. So some things we found that we always get these, every time we update a major package on the fleet, we always get a sizable number of these things. And we get issues like duplicated packages. So you end up with a machine that does both system D233 and system D234 installed at the same time, which is not ideal. And you fix this using something called package cleanup, which is a tool part of Yamutils. But of course, you don't want to do this by hand. So what we do is that we have a small shell script that runs at the beginning of every time we run Chef on the machine. Chef runs every 15 minutes to converge the machine to the steady state. Before Chef runs, we want to make sure the machine is not terribly messed up. So we run this package cleanup wrapper that just runs package cleanup in various ways and tries to resolve the transactions back and forth. And that's one thing. The other issue you can have is general issues around RPM-DB corruption. And this happened especially if you happen to kill RPM in the middle of a transaction. There's a very good chance you'll end up with a database in bad state. And there's not really a single recipe to fix this. Like a lot of time, you have to try various solutions. So we wrote a tool called DCRPM that takes care of these and tries a few remediations back and forth. And we go to the database until the machine is in a non-terrible state. The package cleanup stuff is not terribly interesting because it's just a shell script. DCRPM is actually somewhat interesting. And we're looking at trying to get it open sourced. We just finish it, rewriting it in a way that should be more maintainable. So with these two together, you tend to get a system in a reasonable state, unless it's really broken because of, say, hardware issues. But then you can end up with other interesting problems. Like we've seen when we did, I think it was 231, 232, we saw that we had a lot of machines that would come up, they would do the upgrade. And then suddenly, basically nothing would really work. They would run system CTL and it would fail. And the machine would be really sad. And it turns out, for reasons unclear to this day, we ended up with P1 running 234, but the library system delibs from 233. And they're dynamically linked. And nothing really works at that point. So my fix for this, which I'm not proud of, but works, is in the sequence of remediation, running LDD on the binary for P1, grabbing for stuff missing and forcing every install of all the packages to the right version. Surprisingly, this works. It's a fairly crude solution. And it's not great. But these together puts it in a state where we can update system D and other major packages. And we do similar things for other packages. And the first two are generic. The last thing is specific to system D. But these together puts us in a state where we are OK. We can update system D and other packages on the fleet in a reasonable fashion. Then there's the other side of the coin, where we have to track what's going on in upstream and upstream exchanges. And one change that happened a few releases ago was the change of the build system from auto tools to Mason. And if you're not familiar with this, Mason is this new Python-based build system and system D transition from one to the other. And they had both supported in one version and then they had dropped auto tools. And that's fine. OK, we expected to do a bunch of work. Luckily, the Fedora package has already all the work, because the Fedora packaging was moved to Mason almost immediately. So we had time to basically rebuild our backboard on top of the new packaging. And it was mostly OK. The annoying bit there was that Mason on Santos didn't actually work at all. So we had to backboard Mason and Ninja and a few other Ancillary Python things to keep it happy and fed. But that worked well. One good thing, though, we got out of this was some improvements on the Compadlibs. So Compadlibs is something that's really old and cranky about that. Unfortunately, we have to deal with. So CentOS 7 ships with system D219, I believe, like a fairly old version of system D. And that version was before there was the library split. So all the version of system D shipped with split libraries. So you'll leave system D demo, leave system D login, leave system D whatever. Newer version ship would just leave system D and all the symbols are there. It happens that packages like, say, Apache or Samba are like a lot of system packages you really don't want to rebuild link to the old libraries. So if you want to use a new system D, you need the old libraries in one way or another. And I'm starting to up support for this because they're ancient. So we used to have these pretty nasty patch to reinstate them and plumb them in the build system. And that worked. It was quite a pain to keep it forward ported, but it was fine. With the move to Mason, that patch had to be tossed because it wasn't really workable. So we started approaching alternative solutions. And the thing we came out with was stealing some of the code from that patch, making it into a standalone project and leveraging the sub-projects feature in Mason. So now we have a standalone thing that when it builds, it picks up a copy of the latest system D, builds it, builds the Compile Libraries and links that advance the symbol and gives you these dot SOs and this work. So we published this and it's available on GitHub on that org in case you need it. We also published the RPMs that we use in the other repo in the RPM backports. So you can also directly get spec files. We're gonna keep this mostly up to date and in sync with system D, even if you don't really need to because these symbols are so old that they're not really expected to change. Now, okay, then there's interesting things that happened because we have a special environment. So when we were old, I said before, we are on two, three, half and two, three, four, half. The reason is because we were doing two, three, four and it was fine, we would lose a few machines, but you always lose a few machines, that's okay. And then we bumped the shard to do the update to go to like 50%. And we lost a lot of machines and that's not great. And we started looking at what was happening and you go on these machines and SSH takes five minutes to let you in, you get on the machine and system D just hangs, like you do system CTL and it stops. And these things are not good. You run a trace against PID-1 and you see it's hanging trying to open a TTY, trying to open TTY-0, which is really not good. And we pinned this down that it ran deep on the reexec, went from two to three and two to three, four and two to three, four got stuck trying to open this TTY. After some digging, it turns out our container manager, Tupperware, also happens to fiddle with the TTY and the code in the kernel that deals with this is pretty old and there's probably a race there. And we don't know yet what these is, we're still trying to figure it out, but the current theory is that there's issues that are very bug in the line discipline code on the TTY subsystem that is triggered sometimes if you call demon reexec on a machine that's also running our container manager that happens to do things to TTY-0, in some cases, we have a artificial repo of this, but the actual repo is pretty hard to get. So the fix is that Tupperware probably should not fiddle with TTYs at all. That's like all the code we should not have. So we can fix that if we just make it use a PTY, but there's also a bug in the kernel that our kernel folks are trying to figure out. And that's an example of something that you're unlikely to get if you have a few machines or even a lot of machines, but when you have hundreds of thousands of machines, these kind of crazy things unfortunately come up. All right, now let's talk about interesting things we're doing with system D. And the first thing is resource management. So we're interested in resource management because we run a lot of things on our machines and we want to be able to control and make sure the machine is doing what it's supposed to do and the process that it's supposed to run that has actual work, say the web server has the resources in these and doesn't get contention from random auxiliary services, say the thing that controls power on the machine. So we do this using SIGURV2. And I'm not gonna talk about SIGURV2 because my co-worker Chris is gonna do a talk on this tomorrow. So if you're interested in SIGURV2, you should attend that. For the purpose of this SIGURV2 is a kernel API that lets you set resource limits on processes. And by resource limits I mean things like memory, CPU and IO. And system D leverages this and system D lets you apply limits to your processes and to your services. And it also lets you bucket your services and partition them in slices and apply limits to the slices as a whole. So we use this and we use this to bucket the services and apply these limits. That's part of the picture. That's the implementation. You also need a way to tell though what's going on. And we do that with a small demo that runs on the boxes that picks metrics from SIGURV2 and shows them in our monitoring. So we can get data around, okay, the limit is set here and the current memory usage is there. This is actually working or oh, this is trashing. There's a problem here. The other thing we do is that we have an API in Chef that lets people define and set these changes without having to go on individual hosts and like do system CTL edit and change things. And the way we do this is that we have an API in Chef and this translates to override, system D override files. So it applies the changes and you can say, change the memory limit for this thing here or move this service from this bucket to this other bucket. And this is something that's very much in progress and we're still learning what the right things to do are. This is the general hierarchy we're using right now where there's a system bucket and that's stuff that runs on the box that needs to run on the box but it's not critical to what the box is supposed to do. There's workload which is what the box is supposed to do, say MySQL or HHVM, the web server or whatever. And then we have another bucket that's called TBD inventively that's meant for stuff that we need to run unrestricted, say the hardware folks have to do a stress test on the box and that doesn't need to be limited because it's a stress test. So the initial idea was okay, we'll like cap system to four gigabytes of RAM, we'll leave workload unlimited, it will be great. It's really not. First because our working set is way larger than four gigs. So if you do that, the machine immediately dies. Also because it turns out it's not quite easy to tell what the workload is. In a lot of cases, say your workload is the web server but the web server gets configuration data from another demon. If that demon is in system and gets capped and becomes really slow, the web server is very angry. So you need to figure something out there. The way we're addressing that is by making a sub bucket under workload and moving some of these demons there so we can control them better. The other thing we're doing is shifting the focus from doing memory limits and hard limiting in general to doing protection. So doing things like system D as like memory low or memory.low in C group parlance that give you protection, like guarantee that this service will have at least this amount of memory available rather than a hard limit this other service. And this isn't always the solution and all of this stuff is hard. Like it requires you to understand in fair detail how your service works, what the dependencies are, how memory management in the kernel works and how these apply together with C groups. So there's a lot of work to do to make this more simple and easier and where we hope to be able to get to the point where we can give people a tool that they can run and can give them an idea of, okay, these are sane defaults for your service based on what we see. Start with that and try. So another thing we are interested in doing is service monitoring. And that stands from the fact that as I said before, we had like five or six ways to do service monitoring, service management for bare metal, but they were all like fairly blind. Like we didn't have a good way to pull metrics from services. But system D does. Cause system D knows a lot of things about you because it's supervising you directly. It stores them and crucially it exposes them over D bus. So you get properties like all the timestamp properties. When did they start? How long have I been running? When was I restarted the last time? 235 added a thing that's really awesome and restarts. That's by far the main thing that people ask me that they would like to know. Like is my service flapping? People are really interested in that. Cause if your service is flapping, it's very likely that something is wrong. So these kinds of things are useful and they're easy to get out of system D or D bus. System D also gives you status events where you can hook to the service on the bus and you can hook to system D on the bus and it will send you a notification whenever the service changes. So it goes from inactive to active and so on and so forth. So it gives you basically a view of the state machine and what's going on. Now the downside of this is that these are exposed to D bus so you need to talk to D bus to get them out. And so we started looking at ways to do this. Well, so by the way, you don't need to talk to D bus. You can use system CTL show and it will dump them all. The problem of doing that is that you don't want to run system CTL show in a tight loop on every machine because that's gonna take up a lot of resources and it's fairly brittle. So you'd really want a programmatic way to do this. So if you look online, there's a few tie projects I'd say on GitHub that already do this to some extent and all of them use either Python D bus or Lib D bus which would be great except after two weeks trying to get that to build in our internal build system we essentially gave up. And I'm still not quite sure what was going on there and it's definitely not an issue with D bus itself it's an issue with like how we do things at Facebook but the short long story short it was just, it was not resulting in minorities that would run. So we started looking at alternatives and obvious alternatives will be using SD bus which is included in system D for a while. The downside there is that SD bus is a plain C API and there are not really any wrappers for it. I really didn't want to write this in C because then I would have to maintain it and like most people in infrared Facebook are like Python people and not see people. There's also C people but they're harder to come by. And I know so far a prototype because I don't quite know if this would work. Like writing a prototype in C not a great idea at least from my standpoint. So I started looking at things and I found that the chorus folks had also bindings in Go and these bindings don't rely on Lib D bus they're standalone and they work. So I don't know Go but I figured I could pick it up. So I did that and I wrote a POC of this in Go and this is basically a small demon that runs on the box. It hooks using the D bus API to system D. It pulls for units properties and uses subscriptions to get events and then it collects this data internally, messages it a bit and then it should sit out to a few of our monitoring systems so we can get like pretty pictures and data and things like that. And that was fine for a POC. I was a bit uneasy of doing this in like a language I didn't know very well and basically as a hack. But then Alvaro came along and Alvaro works on Instagram and they have the same problem because they use system D to manage their services. So I started looking at these and Alvaro is a better coder than I am and he knows how to use Syson. So he wrote a wrapper using Syson on top of SD bus. If you don't know about Syson is this magical Python thing that lets you call see a C API from Python and expose what you get back in terms of Python objects. So using the system we can get we can talk to system D through the bus and talk to it and we get to interact with real Python objects that translate that internally to the bus calls which is kind of neat, especially for prototyping because you get a nice wrapper where you can like directly poke at things and get properties and see how we behave and see if it makes sense. So we're likely going to use this and rewrite my prototype on this. We're going to open source this. We actually have the repo mostly ready so it should be out in a couple of weeks and system D itself we have to figure out how to release it but we expect to release it at some point ideally this year. So hopefully people will find this useful and I would love to get feedback on whether something like this could be useful or if you have ideas of like how you deal with this problem or ways we could do this better. All right, now let's move on to interesting stories and case studies. So by far the main thing that still causes random issues that we don't quite understand is D-Bus in production. The problem with D-Bus is that when the demon gets sad or angry, system D also gets sad or angry and the problems with the D-Bun cannot really be resolved without rebooting the machine because D-Bus doesn't really support takeover in place. So most of the time when bad things happen your only result is rebooting the machine which is fine for a laptop but not fine for a server. The other problem is that if you get the system to a state where a system D is still mostly working but the connection is severed so system CTL is not working. Either if it's hanging or if our product failing we like connection to bus failed. For us, that manifests a chef failing on the box because chef cannot manage services anymore so it will yell at you. Which is a problem because then people get all the chef failures but they're not really actionable for them because they see, oh, what is this D-Bus stuff? The other thing we found is that it's actually surprisingly easy to dust D-Bus and put it in a bad state. We had a co-worker who was doing tests with user services and he brought a thing that when you would log in on the box in your batch RC will start a user service that will do some stuff and then start other things. That semi-reliably managed to crash it not on every machine on like a good chunk of machines. Unfortunately, a lot of these problems are hard to pin down and I would love to be able to get like, okay, here's a repro and file a bug upstream but I don't have that. I have on x% of machines, sometimes this happens and it's really fucking hard to find it. So right now we're mitigating this by basically rebooting machines when necessary and trying to keep D-Bus up to date and get bug fixes. We're also looking at alternatives. There's this D-Bus broker project that's an initial replacement of the D-Bus demon coming up to the bus one project that we started looking at and testing on a small number of machines and it looks promising but I don't have any hard data on whether it works it works well at our scale but I'm definitely interested to see how this fares and if it's going to behave in a more reliable way or in an easier way to remediate in case of issues. Let's move on. RPM macros, yes. So we, this is one that's more like people problem than technological problems. So we, a lot of the stuff we build at Facebook they're not like large complicated tools. They're like one binary, you have to ship one binary, one system DC unit and that's it. So we have a tool that you feed it your build config and it spits out an RPM, it makes your spec file, it makes your system, it takes your system service at the right macros and here's your RPM, which is fine. The macros that you get by default by Fedora restart your service on package update which is also fine. Except a lot of people are used to the old design pattern of France and five, France and six, when you do in Chef you do package upgrade and then notify your service and ask Chef to restart it. If you're doing that and you're also static on upgrade you're starting twice. Which doesn't seem to be a bad thing except maybe your service takes quite a while to start or takes up a lot of resources when it's starting or starts talking to the world. So this fix for this was really easy. You add a knob in the tool so you can disable the default restart behavior and you're done. The trick was socializing this and understanding that this was actually the problem and figuring out how to propagate it. And there's a few things like this that I'll have later that are not necessarily technical issues but more like getting things to be better understood. Another area of interest for us is the journal and logging. So the journal for us is set up to only log in memory, log in a small in-memory buffer and then feed everything to syslog. And we do this because we have a ton of infrastructure based on syslog, our own security. So we need syslog to keep going and people are used to syslog. People want to be able to tell the Chef our log messages and see their stuff. On the other hand, we found pretty quickly that when people actually start understanding what the journal is and start playing with it they really like it. People start using journal CTL and are like, oh, I can filter by things, this actually works. So people started asking us, okay, can I use the journal? Because I'd like to get more data than just that shitty megabyte buffer. You can, of course, you can enlarge the buffer, you can make a store on disk. When you do that though, you end up with that double writing problem because you're writing both syslog and to the journal. They're both right to disk and some of our tools are really chatty. So you can end up writing megabytes or more per second which is not ideal if you're IO constrained. And it's also an all or nothing proposition because you either do this system-wide or you don't do it at all. So what we'd really like to get is some way to do this on a per unit basis so people could say, for this application I want the journal data to have these buffers and go on disk and be persistent for this to be transient. But we haven't found a good way to do that. So for now what we're telling our customers to do is either take the hit of the double writing or set things in such a way that they silence some of the stuff that ends up ending up in syslog but that causes other problems. So it's not really a tenable solution in the long run. Now another fun, for some value of fun problem we had was loops. So I didn't know this. I discovered this when I hit this problem but if there's a loop, system D breaks the loop between, a dependence loop between units, it will break it by removing an arbitrary unit from the loop. And it tells you, it puts a log line like that. And we found this out because we would machine and it would come up and it would be missing temp files and directories that were supposed to be created by system D temp files. And then Chef would fail because this directory that's supposed to be here is not here. And it's like, what the hell? So you look at the logs and you see that line which is, okay, that's interesting. Why? The way we debug this was using system D analyze and system D analyze gives you both boot chart style plots like time zero, the system boost, time one, system D starts, time two, the service starts. And it gives you dependency graphs. So nice, like octopus style graphs with dependencies. That's completely useless for us because if you plot it, you end up like the size of this roaming graph paper to get it out. But you can make it plot as small subset and that part is useful. Because in this case, that line gave us enough clue to know that like SMC proxy, which is a thing we ran was responsible somehow. So we started using analyze from there, discovered that it has something to do with mounts, found that somebody added to FSTAB an entry to ask to mount this network file system and make it require SMC proxy, which is fine, but they didn't say it was actually a network file system. So these ended up ordered in an order in conflict because it's like, you need the file system, which needs this service, which needs the network, but the network needs this and this doesn't work. So fixing this is adding net dev to the FSTAB entry and then we're done. But the process of finding it was a bit more interesting than that. And that's one thing that we also, like I wasn't personally aware of this was a thing. So it took a while to dig through the logs and see that this line was a problem and this is what was going on here. On a somewhat similar vein, another fun problem that we had was around training units and system Iran. So we have these machines that do builds and testing for like mobile apps. And the way they do that is that there's a lot of phones plugged in USB essentially, and then we run stuff that does things on the phone. And the way the team that manages does it, they use system Iran to start these processes and they run these using device allow policies so they only talk to the thing they're supposed to talk. And Iran is a lot, because we do a lot of tests, we have a lot of phones. And every time this runs, it enqueues a unit, runs a unit, does this thing, disposes of the unit, all good, but like USB cables aren't great, sometimes stuff breaks, so this can also fail. When unit fails, system D let them stick around, they stick around in like failed and they're there. Not a problem if it's like one, two, three, five. Wait a bit. If you end up with like 10,000 units, there's something in P1 that reads this and ends up taking a shit ton of CPU. Like we saw with 10K failed units, this goes around 50% CPU usage on the box. With 30K it goes around 100% and that's when people started noticing because it was kind of interesting. So the fix is putting a cron job that caused reset failed. Yes. Yes, that's also what I told people to try. But like for now, the way this was fixed was because people wanted to get a sense of how bad the problem was. So what they did was they added a monitoring counter that would tell them how many failed units are there and then they added a remediation if it's past the threshold, call reset failed. But yeah, ignoring the result is likely the better option. I need to check if there's a way to do that with system D run directly or if they need to write a unit. I just thought about that that we maybe want to improve system D run so that system D run can still extract the failed stuff but then system D run automatically makes sure that the thing goes away. Oh yeah, that would be awesome. I think that would take care of the case here. Yeah, because in these cases, failures that are not really actionable because if it's failed, it's not because of something we can control. On to more fun, this time, C group related. If you have a service system D by default when you terminate it, it kills everything in the C group. You can change this. You can tell it, don't kill everything in the C group. Only kill the main process and leave everything else to fend off by itself. Which we try to discourage people from doing but people are stubborn and sometimes do their own thing. If you combine this and when that happens, if you end up with leftover processes, the C group stays around because there are services in the C group, of course. There's processes in the C group. So what happens if you then reload, you change the configuration of that service that before it was in this slice, you make it go to these other slice and then reload system D and ideally you'd expect it would apply the change. Turns out it doesn't if there's already a C group running with these processes in. Which makes sense. And fixing this is easy. You kill everything or you change it to the control group. But this was kind of surprising and it caused a bit of customization for people because they really wanted to use the kill mode process for reasons not particularly interesting here. And the fact that the old unit was sticking around wasn't quite as evident and you basically find out because your assistant city held status and it shows a stopped but it also shows the C group is still there and then you poke in the C group and you see it is there and there's processes running and then the only resort to just kill everything. Yes. Yes, in this case, yes. The question was, is there a reason if we're not using kill mode mix? In this case, the reason was that the thing the thing we were starting was not actually the process. It was a thing that was spun off a bunch of children and do hand-wavy supervision of these children and they wanted that they could replace the supervisor but not the children. It was something like that. So the solution was redesigned is to not do that because you shouldn't do supervision in your application. But yeah, I don't know if we tried mix specifically but I remember the folks that had this system were fairly adamant. They wanted to use kill mode process the way that it designed. But yeah, for the specific thing there's not much we can do because the current behavior actually makes sense so it's not a problem. It's just something that's not evident at all. Finally, for most stupid and yet irritating bug I found the escape system is this logic that translates device names and paths to union names and it uses shell control characters. So slash dev slash something becomes dev dash something and then escape. Now slash escape is also a shell control character. So if you take that and then write I don't know system CTL status something and shout out to that without escaping it, you're gonna have a bad time. So chef did that which made it interesting where we were trying to write a cookbook to manage swap devices in some special way and it would like fail because of a weird shell escape stuff. So we fixed it that the fix was like really trivial and it was basically make sure we call a shell escape. This was something that was somewhat embarrassing because it took a while to figure out that was the actual saying that was going on and we're like, ah, we should have cut this before. As part of this we've wrote we have a cookbook dev to manage system data that's open source. We open source all of our cookbooks there on GitHub. So we had a small wrapper to wrap the system the escape function to get a union name from path so you don't have to shout out from it. So this closes the gallery of horrors. So I wanna spend a few minutes to talk about how you do these things that try to not make people too angry because of course, if you look at things from the point of view of the service or the application owner, you kinda like your system to be stable and sort of static. On the other end, you want to get new features and you want to know that your system is stable but it's also getting updates. So the approach we found that works reasonably well for us is trying to communicate things as much as possible to people and make sure people understand why you're doing this. So announcing core package updates widely and announcing changes that are happening and why they're happening and what's going on with them and giving people a venue to comment on whether, oh, this feature is interesting. I'd like to know more, I'd like to use it or, oh, this is gonna cause me trouble. And doing these updates in such a way that people can react to them and give you early feedback. When we did one of the system version upgrades it happened to actually cause issues to the same folks that did the phone stuff because they were doing other secret things and it was fine. We could like just pin them to the old version for now or solve it and then move on. The other thing we found is the documentation is critical and while I'm saying documentation is great and very detailed, it is not great if you don't know what you're doing and you're starting from the place of wanting to learn. So we started writing like internally like snippets of like, okay, this is the suggested way to do things, to do like a basic unit to do these or that. And we found that doing this with the customer use case in mind works better. We also encourage people to like follow what's going on with upstream because we don't have all the answers. So people can read the source code because this is an open source project. They can go on the mailing list. They can look at things. And finally, we found that talking and giving tactics internally both as a company and both going to a team and seated at a team meeting and talk to them for half an hour about what they're doing, what is the system they're saying, what you can do with it, goes a long way making people happier and more amenable to using this and leveraging it and then actually liking it and building cool things with it. With this, thank you very much. And I'm happy to take any questions you might have. Minor clarification. He said all of our cookbooks are open source. They are not and our lawyers will get angry. Yes. Some of our cookbooks are open source. Some of our core cookbooks. Yes, thank you, Phil. Okay, so I was curious, you were mentioning one of the things that you're having issues with in terms of Chef integrating with system D is with system control and things hanging. Is that because of Damon reloads or something else with the E kind of lockup aside from D bus issues? So Chef shells out to system CTL for basically every interaction it does with system D. So if I have a unit that say, I want this service to be started. I want this service to be enabled. I want, I'm adding a new unit where I'm changing a unit I need to reload system D. All of these things end up calling system CTL. So each and any of these interactions, if the system is unhealthy and system CTL either returns an error or hangs, will lead to Chef failing. What I'm curious is if you know what the root causes on the system D, say, pit one side of why those things are hanging for you. We know it sometimes. Like the kernel TTY raises one example where we found out what it was. In other cases, we don't. And for some, we pin it down to D bus. Most of the ones that are like unknown, they're like, we're fairly sure it's something that with the D bus columns, but we haven't managed to report in a way that we can actually pocket it more and get some actual data out of it. Yes. Regarding the issues with RPM database and inconsistent packages, have you looked into deploying using atomic or OS tree or similar stuff? No. And I'll tell you why. The main thing is that for system packages specifically, we wanna try to stay as close as possible to how upstream is built and how upstream works because it makes it a lot easier to keep things, to keep things not too alien so people can understand it and when you interact with people outside. We might, well, look at that in the future, but I think in the short midterm, I think we're gonna stick with RPM and just deal with the things. The other thing we're doing is actually trying to make RPM better and we're actually engaging with the RPM developers and the YAM developers there. So hopefully we'll have something there soon. Well, my time is up. Thank you very much, folks.