 Okay, so this is almost a perennial talk for me at this point and talking about the challenges of maintaining configuration changes in PID 1 in terms of reloading state, especially for unit changes on disk. This has particularly been a problem that we've encountered at my company Pantheon because we sometimes have deployed thousands of units to disk. And I've personally worked on the problem a little bit, but I've also gotten a lot of feedback from our own ops team on what we've seen with our Chef-based configuration and managing all of these units. This is not going to advance. Let me restart that. There we go. So part of the issue, you have to understand a little bit about the architecture of PID 1. Almost 99% or more of operations in PID 1 all go through this central loop where basically if something has been requested over D-Bus, if something has been requested, if something is in term, if something is where PID 1 is reporting watchdog information, all of that is going through that single loop. It looks like this. And it's in the manager. And what it does here is you can actually see, I'm not really going to go through all the code here, but basically this has some rate limiting in it. It has the watchdog reporting to the hardware watchdog of the system. This is separate from the watchdogs for services themselves. And it also interacts with D-Bus and eventually basically runs everything through here. But the problem is that this is also a single point of blocking in there. And the way that we do reloading today, whether you're doing a daemon reexec or a daemon reload, is we basically tear down the entire world with the exception of serializing some of the runtime state. And then we rebuild it off-desk, and then we un-serialize the runtime state. And in the meantime, a whole bunch of things are not happening. It's not able to, I don't believe it's able to notify the hardware watchdog. It's not able to invoke scheduled timers. It's not able to receive watchdog notifications from supervised services. It's not able to provide unit status to monitoring tools. This is the case where you have something that is querying unit status on the system to report failures. Those could time out. It can't process socket notifications or socket activations. It can't automata file systems when they're getting access. And in terms of things running on something like Chef or Puppet or another administration tool, it can't stop, start, or reload services. So it's actually pretty disruptive to have a really long reload. And the reason why it's worked pretty well so far is that the vast majority of systems deployed have few enough units that this is not actually taking long enough to become our problem. But as soon as it does become our problem, it becomes a pretty big thorn in the side of the infrastructure because it interrupts all these things. And we've done work so far to mitigate the problem. The changes internally to system decode, we have the ability to set some options on some units without actually reloading them. This doesn't include changes to dependencies or changes to a lot of the key configurations of units. But there are some things you can do with that. And we've certainly put in place a lot of optimizations for the actual reload process itself where even though it's blocking, we've shortened the time. And externally, at least at my company, we've worked on changes to do reload less. For example, one of the things that I can't emphasize enough to anyone running into this problem is you only need to run daemon reload when you change a unit. If you drop a new unit file, you don't need to run that right now. And in fact, this has been the case for years, where as long as you're dropping something new, you can just do start, it'll load up the unit, it'll add it to the tree of things, and it'll go on without running the reload. So one thing that we do in Chef at my company is when we're dropping a file, even though Chef is designed to be item-putting with the way that it drops configuration, you can have it do a different strategy for reloading depending on whether it dropped a new file or whether it's altering an existing file. We've also moved a lot of the configuration somewhat regrettably into environment files, so that that is actually something that is read at runtime and doesn't require a reload if those change. And then we've looked at things like moving fewer units, removing some units from the PID-1 system D to maybe some child instances and things like containers. And honestly, one of the biggest things that we've seen as an impact is using SSDs for this. If you still use spinning disks on your servers, you're actually taking a huge penalty to your reload time because the random access for loading all these files off disks and all the sim links off disks for which units are activated is, I think it's sped it up for us about a hundred or a thousand-fold some of these operations. So it's a useful thing to look at if you're running into this. But these are mitigations. These don't actually change the fact that the reload process blocks PID-1 and the central event loop for whatever duration it happens to take. And there's still work that's on the horizon. I was talking to Leonard yesterday about what was going on. And apparently it's really close in the pipeline to be able to reload individual units and not have to reload everything all at once. So at least that should mitigate things further. And in terms of the effect, one effect we've seen is not just in PID-1, but when things get spammed on D-Bus in terms of events for units, especially with things like reloads, because basically everything gets destroyed and everything gets recreated. Scaling work like the D-Bus broker should help with things like that as well. I expect. That hasn't been a problem for us in a while, though. But this is also still not going to solve all of the challenges, things like generated units from FSTav, legacy things from system 5 in it that get converted into units when system D loads at state. Those do not get refreshed and don't have a means to refresh them through the pinpoint reloads. And ultimately, you have to update system D. And you may update it frequently if you're using it very heavily in your infrastructure. And Daemon re-exec will still have just as blocking of an interaction here as it does today. And then another impact that you will run into is that even if we have pinpoint reloads, there's still going to be a bunch of systems out there and a bunch of configuration management tools that assume that they have to reload all of the state when something changes. It'll be a while before things are actually targeting their reloads a little more surgically. So just to go over what happens today in a little more detail than just kind of the tear down the world and rebuild the world. For Daemon reload, what basically happens, oh, I had a third bullet point on here. For Daemon reload, it basically serializes the runtime state, deletes the data loaded from disk, reloads that data from disk, and then un-serializes it. As a convention for each of these things, for today and for ideas for moving forward, anything that is an italics is blocking. It basically blocks the main event loop in the sense that nothing else is getting done while that is happening. And generally, most of them are not amortized in any way where they have a particularly bounded time for blocking the loop. Like each of these operations here will just take as long as they take. The biggest benefit to the current design is simplicity. And I don't want to underrate that because simplicity means that something is generally robust. It means that it's easy to make it correct. It typically works. It's a design that doesn't have a lot of challenges around that. The biggest problems with it today are heavy blocking and high privilege parsing. In the sense that heavy blocking is the main thing that I'm talking about, where when things are reloading, there's a pretty extended duration that things get unresponsive. High privilege parsing is a little more debatable. This is the idea that we're using PID-1 to parse the date off disk. If whether the content on disk is malicious or there's potentially defects in PID-1 with the parsing, it introduces some risk to the system, either from a security or a liability standpoint that we have such a critical system that can't recover from a failure doing this sort of loading process. And then we also have Daemon Reexec, which operates pretty similarly in the sense that serializing state, handing over control, loading data from disk, un-serializing, the amount of time for blocking for Daemon Reexec and Daemon Reload is not massively different in my experience. And this supports an important thing, which is updates without reboots. Something that you can't do is something like Dbust, for example. And the cons are pretty much the same as for Daemon Reload. So I have about five kind of ideas for improvement of this. And these are all, they're not necessarily mutually exclusive in the sense that some of these things could be combined. We could possibly do partial elements of some of these. But they're based on some of my experience totally outside the system D world. My experience from the PHP world, where basically we have a multi-process runtime and working with shared memory segments for some of the storage, and working with varnish where varnish exposes its statistics through a shared memory segment to a program that has been compiled with the same data structure library for reading that data. And then that way they allow you to query the statistics of the web traffic going through the proxy server without having to do things like a remote procedure call or have anything within the actual main process respond to those things. But I've also had a lot of problems with some of those architectures in terms of the complexity of them or the performance impacts of them. So I'm gonna try and cover a lot of that. This is probably impossible, but it would be really neat if we could. By impossible, I mean it would probably require some serious black magic of how you have PID1 run. But it would be really, really neat if we could basically use the existing reexec process to start a child or something similar, asynchronously, have it load the state and then hand off control to it. I don't think that this is very likely to be doable based on conversations I've had. But it would be neat because it would basically solve all of the reexec and reload problems in one fell swoop. But there's probably no way to actually run a separate process and promote it to PID1 without having it take over in place but then block in place. And when I say things on here, like too heavy for embedded, I'm usually referring to the idea of when we were working with the state, one of the biggest advantages to the current method for embedded systems, at least for memory restricted systems is that because we destroy the world and then rebuild it from scratch, there's no point at which we have a memory footprint that exceeds the old state or the new state. And if you wanna look at non-blocking methods for reloading things, most of the methods that we could use generally involve us keeping the old world around while we start up the new world. So in cases like this, that would be a problem where we'd basically have a memory footprint that would peak at the combination of the old state and the new state. Anyway, I don't have any diagram for this one because it's impossible. But I do have a whole bunch of other ideas for possibly changing this. Another way that we could be doing this is by treating a separate daemon as a library for this information. And rather than having PID-1 actually maintain the unit data itself or even have direct access to the data structures itself, there could be one that would be queried over say a socket or debus or some sort of inter-process communication to a unit data database, basically. Presumably this would be something that would run alongside PID-1 and it would support the idea of either starting up a second instance of it to load the new world state or to be able to do the complex juggling necessary to switch over the state by reading new unit files off disk and then possibly having things like, like for example, this sort of thing, especially isolated outside of PID-1 could run something like RocksDB which is like an embedded database that Facebook writes that supports the idea of snapshotting where you can basically say, here's all the state that I have, I'm gonna snapshot this right now and then all the changes going forward will be ignored by the current PID-1 until I'm finished reloading all the state and then I will tell PID-1, here's your new state or switch over in a way where as soon as it's accessing it after all of the reload has occurred then it can communicate with that. I would be reluctant to embed a database like that in PID-1 directly but if it's isolated from PID-1 then it's a lot less critical because you can replace this process a lot less disruptively on the system and it can also run with lower privileges and these are the primary benefits of that sort of process. You get a low privilege parser, you get to do high level IPC the reason why that matters is that I found in terms of benchmarking inter-process communication that sending fully baked data structures around is a lot more efficient than accessing shared memory that you have to jump around a bunch of pointers because there's a lot of indirection that occurs in the kernel for inter-process storage that is directly used by say two different things and it's reasonably efficient to ship something like the scale of unit data over something like a Unix socket or D-Bus. It has the potential to be extremely memory efficient in the sense that because we could take we could use more complex methods in a second process we would be able to preserve the memory footprint to a pretty low level. In the case of the sort of copy on write system that I'm talking about with something like ROX the memory footprint would only peak at maintaining the data of the removed or changed things from the old state and the new replacement units in the new state but it wouldn't peak at twice the data it would peak at a slight bump. So in terms of ability to juggle old and new state that's about as efficient as you can get a sort of copy on write system. And one thing that would be really neat is if we got this sort of storage to work well then we could probably retire Daemon reload. We could probably just do Daemon reexec we could simplify PID 1 because since all this data would be being stored in a separate process then we wouldn't have to worry about the most efficient way to reload state in PID 1. We could simply say PID 1 reexec hand over your runtime state and continue accessing this Daemon. Anyway, another one that I think is actually in some ways more attractive and I know that Leonard has heard about from at least one other person proposing some ways to solve this problem is creating a sort of unit loader that constructs a pre parsed image file that has all of the data already structured all the dependencies already constructed all of basically in as close to the final state as possible as a file. And what you can happen there is the unit loader can load that stuff off disk it can create an image file that can get written to disk it can get written to a tempfs for all that matters and then PID 1 could immap it or do a similar operation on it. And the nice thing here is you could basically do a sort of almost OS tree style switch over with it where you could have the current parsed data that system D is using in effect based on the way that Linux handles immap for data you could actually delete that file off disk as long as it's still mapped by PID 1 replace it and as long as it's remapping the data when it switches over the world it would be able to switch over to the new configuration without waiting on parsing a whole bunch of stuff off disk. This would make the switch over almost well it would be atomic as it is today in the sense of switching the world but it would also be nearly instantaneous to switch which world it was talking to with the exception of any kind of serialization or un-serialization but the serialization and un-serialization are not what dominate the time of the reload or re-exec process it's the disk access. This has some challenges though. When you do this sort of thing mapping memory this way you end up with complexity around either mounting it at the same base address every time which can be fragile or you end up using things like offset pointers within the actual memory itself which while they're not a big impact to performance on their own I mean it's basically just adding in the base address of wherever the memory is it means that every single thing that you do with data structures in there has to be abstracted through the idea of expecting that the pointers have an offset to them. So it can complicate the code for writing that. On an upside though even though it complicates the code for accessing that data I don't think that there's a hugely high risk of defects in writing such a system because when you get it wrong and you don't work with an offset it's very obvious versus some of these other approaches that I'll be talking about where we might be able to do things with say threads or other ways to manage the duplication of data where if you get some sort of cross link between those data structures where some of the old data's referring to the new data or vice versa you could end up with some very broken data structures as part of the reload whereas this is that nice packaged file. One of the other really nice things about this is it's extremely testable and you can basically turn the loading off disk into sort of a reproducible build kind of question where for this unit state on disk you expect this data structure to be created you can interrogate that data structure and you can very quickly, it's probably a little high level for a unit test but it's certainly a nicely compartmentalized test that doesn't require running a daemon to actually test out all of the parsing capability and all of the data structure instantiation within the file because it wouldn't have to be a daemon that constructs this stuff it would just have to be a process. The fourth idea on here is something that I know has been talked about in previous years I don't actually find this super appealing because of some of the fragility is moving the reload operations to a thread. Right now, PID 1 is single threaded it's single processed the single threaded for the PID 1 operations and that means that we don't have to worry about things like concurrent access to data structures, locking or any other sorts of concurrency primitives that we might have to use if we adopted something like this. It is probably possible though to adopt this without it having hugely cascading effects on the rest of the design in the sense that in the thread you could rebuild the new unit state off disk in a completely isolated area of memory where none of the pointers to that area of memory would be visible to the rest of PID 1 until it was time to promote it to the current world and then what would happen in the main loop is it would switch the root pointer for the tree of all of the data structures over and then it would be suddenly using the new world and we could destroy the old one. And so it still has a higher risk of defects than these other things. There's no testability benefit to this unlike the previous design and it doesn't actually help us deprivilege the parsing and data structure writing in the same way as the previous design. So in a lot of ways, I would prefer to use the sort of construct an image file rather than reload in a thread and store it all on PID 1. This one is also kind of interesting. This is unlike all of the other ideas this one doesn't solve daemon reexec but it does solve daemon reload. It's the idea that we could exploit the pinpoint reload functionality that loads individual units to allow doing a daemon reload at a globally incremental level. This would require a pretty complex implementation of managing the state transitions because the only way to actually amortize this and keep everything single threaded, single process is for us to take off smaller chunks of blocking work at a time. But we could do amortized scans of the unit configuration on disk which basically means we would be scanning chunks of it at a time. We could cap how much interruption we have each time we do that. We could note the disparities versus the current state because we're already just walking through all the files on disk anyway and not locking anything on disk. We already have possible race conditions around the reloads where if you change something that it gets scanned earlier versus later. So this doesn't introduce a new risk in terms of loading the snapshot of configuration off disk but it does extend possibly the time that that window is open for creating weird disparities. In any case, we could walk through it, we could identify the disparities and we could use an algorithm like Boolean satisfiability which is actually part of DNF and Libsolve as used for package managers. It started off on SUSE and then got adopted for DNF on Fedora. And what it does is it's this idea of what changes can I make that maintain all of the invariance in the state of the system. And this is that what changes can I make that allow dependencies to all still be valid. Like for example, if the unit that, if A depends on B and I'm removing B as part of my reload, then I need to make sure to alter A to not depend on B before removing B. Like I basically need to make sure that I don't create dead ends in the dependency graph as I'm doing these state transitions because what you wanna do with a sort of amortized change is you wanna be, each time you take a step, you take basically the minimum step that's necessary to maintain a coherent state. And then in most cases, this would be pretty simple because most cases changes don't actually have any cascading effects. But the only time that a change would have a larger blocking effect is if the change had a huge cascading effect on the dependencies of the tree. But that's probably okay because it's hard to avoid a huge change to cascading dependencies having a big interruption to the operation of the system. Anyway, with that, I will open for questions. And we have five minutes in the back. Hi, it's more of a remark than a question. So what you should keep in mind if you do any of that is that there are use cases where this isn't the use case. So for the embedded, the world usually doesn't change after deployment and what you should make sure there is the initial startup of system D should not be affected. Yes, that's why I've been referring to the memory footprints of some of these approaches because some of these approaches could double the memory footprint of system D during a reload and some of these approaches would not. And that's why I've referred to things like too heavy for embedded because some of these approaches just basically result in spinning up an entirely new world of data while maintaining the old one. But that said, I don't think they do that significant, the footprints for this. Leonard? You mentioned you had experience with Varnish which implements something like this. What precisely do they implement? Do they go for the shared memory and pointer offset thingy or what did they do? I believe they maintain an inter-process in-mapped block of memory where they use the mechanism where when you do the in-map, you flag it in the kernel that it's going to be accessed in a process. And then the main Varnish process that is running all the web requests, for people unfamiliar with it, Varnish is a reverse proxy cache for web applications. What they do is the main process is writing statistics to the actual shared memory block but is not otherwise interacting with any kind of inter-process things from a remote procedure call basis. And then the tool Varnish stat that shows you what's going on with Varnish right now will connect to that same shared memory segment as read only and it will show you the current statistics of that Varnish in operation. So it doesn't actually do that for configuration but for statistics? It's actually, it's kind of the reverse of what I would do for system D in the sense that the alterations are happening within the critical daemon and the read only is happening on the side of the statistics tool. But I mean, it's so much easier if it's just counters, right? Like if you, do you know of any actual daemon that implements something like the relative pointer thingy, memory mapping stuff? For relative pointers and memory mapping, the, I think, so a lot of the, I've done some extensive work using boosts libraries for this. Not that we would have to use that for this sort of thing but basically they have a whole system where you basically have data structures that are done in shared mapped memory and or just mapped memory with offset pointers and everything down the chain that runs all the data structures has the allocators be aware of the offset pointers and the region of memory that you're working in. It's a lot more complicated though that case than what I'm talking about here because we'd be basically creating a file. We wouldn't need a real memory manager because every time we would allocate a new structure we would just basically make the file a little bigger and then ultimately we would have one file on disk and the other big difference versus the varnish case is that varnish is updating that as a truly inter-process shared memory that is changing constantly whereas this would be writing to a file that would be completely static once it gets handed over to PID 1. So there's not as much of a memory management issue, there's not really as much of a kind of locking or coherency issue because when you, obviously if you have two things changing the same inter-process memory you either need to be using atomics or locks or something but we wouldn't need to do that here. It would just be closer to the idea of what's happening on Fedora with image creation for containers right now where they've created a tool where instead of it being where you create the container and you change, you set up things in there and you capture the image they have a thing where you basically can build the image offline and then launch it. It's not quite the same thing in terms of the fact that of offsets and stuff but the idea is like it's definitely used as a process the idea of creating an offline build of something that then gets loaded for operational use later. A question on that on that M-Map file as well. How do you prevent like that being exploited because you have relative addresses in that that address could be inside some other space that you don't expect? So relative addresses, I don't see how the risk would be substantially higher than, I mean, I'm not actually sure what the exploit path is, you mean? Well, that file could be exchanged before it's passed to system D. So you're basically passing, if you have an exploit already, if you're in the system and you have access to the TempFS then you could swap that file for something else that contains hostile pointers. We just need to manage the permissions for the files. I mean, the same risk path exists for the actual unit files themselves. Like if you can manipulate arbitrary files in the system then you could create a unit that does whatever you want when it gets loaded by system D. So if you have like, we just need to choose a place to store it that would be as safe as unit files that system D is reading off disk. I don't think we have any more time for questions but I'll be around. Thank you.