 Welcome everybody. My name is Lindsay Salisbury. I work for Facebook. I'm going to talk today about container run times and fun times. This complicated part of the infrastructure. Specifically I'm going to talk about system D inside containers and how we are transitioning to that model. But first I want to give you a little bit of background. Sorry. First time I want to give you a little bit of background on what Tupperware is. Tupperware is our custom built container system. We started this around 2011 or so. This was before Docker, before the days of Docker, before containers were kind of an industry standard thing, or industry thing. There's a bunch of custom stuff that we did, that we had to do in order to manage our infrastructure. It's pretty big. It runs pretty much most of Facebook. There's a handful of places in the infrastructure. It doesn't run, but for the most part, almost all of the important stuff runs on top of Tupperware. We run millions of containers. We have billions of containers have been created and destroyed using these technologies I'm going to talk about, the components that are part of the system. We haven't talked a whole lot publicly about Tupperware in recent years. We're going to start doing more of that. I've given a few talks about some of these areas, specifically about the container runtime, because the space is pretty interesting. We get a lot of discussions around various technologies that we use, and various technologies others people use. I think the cool part about this tech is that we haven't figured out and solved all these problems yet. There was a talk earlier today where they gave the big container space in the industry, and it's just like a big I-Chart that has a bunch of different stuff in there. So there's a lot of different options of ways to do stuff. I want to share and talk about how we do some of these things. Specifically, I work on a team called Tupperware Agent. This is the host-level runtime. This is the thing that actually sits on each machine that does the work of setting up and tearing down all the container components. This would be like Rocket or the Docker part of the world. It's basically an interface with the kernel and process manager. It's the simplest sort of way to think about this thing. We try to write this in such a way that it just does what it's told. It's not trying to make decisions. It's being told what to do, and then our job on the agent is essentially just to execute those operations as they are given to us. This allows us to create a relatively stable system and also sort of understand the complexity of the behaviors that we have inside as we set up containers and as we tear them down. It also allows us to create some pretty clean abstractions and some pretty clean testable components that we can put together. I want to talk a little bit about the components that are used. These are more of the conceptual components. They are specifics, but there's a couple of high-level pillars that we're interested in discussing. The first one, obviously, is Linux. This is the kernel. This is a hardware interface, essentially. Linux kernel runs on the machine and gives you access to the actual physical hardware that you're running on. It gives us a process of abstraction. This is the thing that's running work. This isn't anything. This isn't rocket science. And then it gives us the namespaces. We mostly use PID and mount namespaces. We don't use network namespaces. There was a talk earlier about the Katmonde stuff and the fact that we don't use user namespaces and network namespaces. PID and mount are the big ones that we worry about. That's what most container infrastructures are trying to do, is isolate processes and mounts for file systems. C Group 2 is another big pillar component. This is a hierarchical abstraction. It allows us to have resource management in a nested tree. This gives us a pretty clean abstraction for managing groups of processes. Mostly in the agent, we think about things in C Groups now. We don't really think about stuff in processes or individual components. We think about them as large chunks of black box processes. And I'll get into a little bit more detail about why that is in a little bit. And then we also use ButterFS as a file system. This is pretty much across the majority of the fleet. We support both loopback images and on-host subvolumes. So the benefits of one or the other, there's some subtleties there that I could get into a little bit. But the rule of thumb is that loopbacks are useful for shipping things as atomic units. We don't have to care about what's inside that thing. We can just ship the whole file, mount it as a loopback, and then operate on it. The host subvolume is interesting because it allows us to receive ButterFS send streams that come from the build process that gives us basically not a full file system. It's just a stream that allows the file system to be constructed as a subvolume on the host. This has some interesting implications with IO as we're applying those subvolumes and also with cleanup and deletion mechanisms, which I won't get into in this talk, but if you want to ask questions, you can. So we can layer images as subvolumes. So this is a subvolume of like a base file system, and then we take a snapshot and we make a bunch of changes, and then we can ship the base and the new image subvolume below that independently of each other, and we can ship it together as one thing. That's a lot of flexibility in how we manage the deployment to file systems and how we think about, you know, sort of sticking those file system pieces together. So when a container is started to read only subvolume, that is a snapshot of the original image that was deployed. So this gives us the ability to have one subvolume that all containers can be inherited from. We create snapshots and we do modifications to that, and we aren't actually modifying the original volume of the original image. This is the read-write ephemeral snapshots. And then the other big pillar component we use across the fleet is SystemD. So on the host, this is a process manager on the host, the agent executes commands through SystemD to first set up and tear down. We use the D-Bus API for that now. We're starting to sort of expand that infrastructure more. And we're basically using this to replace bash and spawn invocations. We don't really want to have to, you know, go out and shell out to stuff. We'd rather just have SystemD do the workforce. We send a D-Bus request across and it launches a process and we can look at and interrogate SystemD to determine what the status of that thing is. We also use SystemD and spawn for ButterFS image building. So this allows us to execute commands, you know, in a controlled environment in a sort of super-trute or, you know, a namespace-trute and do a bunch of operations. We get sort of like a hermetic environment that we can use to make sure that the image build works as we expect. So that's on the host. Inside the container, we basically care about one thing, and that is a process manager. When we start containers, we start PID-1. Essentially, we start Espin in it. And we use a process manager for a number of reasons because we can run multiple container workloads and multiple container processes. We have services that want to run, you know, different things, so they want to run one thing. It gives us the ability to group workload services together, but it also gives us the ability to run infrastructure services. So we have a whole bunch of stuff that runs inside the containers that we provide as a platform for the other teams that are running services that, you know, our SSH, our syslog, we have sort of debug tools, we have the ability to do security stuff. This is, you know, getting security tokens and things like that. This is stuff that has to run next to every workload because we need it in order to interface with the system appropriately. And what the benefit of this is it gives us, from the agent side, from the runtime side, we just have to worry about one thing. We just start a process. We PID-1 inside of a C-group, and we can monitor the C-group as a whole thing. We can monitor PID-1, and we assume that the process manager inside that container is doing all the right stuff to manage the processes that are running inside there. So it sort of reinforces the abstraction of a workload, right? We have this big thing that we sort of a black box to us. We don't really care what's inside of it. So much we get a request to start it. We spin the thing up, and we monitor and make sure that that PID-1 process is still running, and we have some additional, you know, and we have some additional things to sort of introspect and look inside that. So for a long time, we haven't been using busy... We haven't been using system view. We've been using busybox as our in-it system inside containers. So the way that we put services together is we have a lot of bash that gets sort of like munged together in a startup process. There's no real coordinated bring up or shutdown of these containers. We hope that people write good bash, and we hope that people write things in the right order, and they start the things that are needed before the other things that are needed, but we actually don't have any way to enforce that those things happen correctly. There's a lot of poor signal handling. The way that we have been shutting down containers for a long time is we blast a signal out to basically the entire process group, and we hope that they shut down in the right order. A lot of times they don't. We had to introduce this concept called a kill command, which allows us to basically... People can give us arbitrary bash snippets that will go kill the right thing, hopefully. This is somewhat problematic, because it's super difficult to debug. It's super difficult to see what's going on inside there. There's really no way to test these things in isolation, you know, and service composition is super difficult, because what we have to do is ensure that people are starting things in the right order, and we can't guarantee that with this current system. And maintenance is really too hard. If we want to add infrastructure components, we have to modify core parts of the system that are critical path code, and it basically makes it super dangerous to start adding things into the container like that. So what do we really want? Well, what we really want out of a process manager is we want to interface for interacting with kernel abstractions. We want to be able to talk to something that will set up processes and set up the services, set up the things that are running inside there cleanly, and basically do it in a predictable way. We need a contract for defining services. We need a way that we can allow our users to define how they want to run stuff. We want orderly startup and shutdown so that we get things started in the right order. If we have to reorder something, we just change the config, and we restart the thing, and everything starts up properly. We want to be able to shut down cleanly so we don't have to just blast signals out so that things turn off correctly. We want predictable behavior, obviously. We want to be able to see that stuff works consistently each and every time. We don't want to have a bunch of weird random bugs because somebody introduced a tick mark in the bash script that didn't get escaped properly and ends up breaking a whole bunch of stuff that's happened a number of times. The other thing we really want to be able to do is have flexible service composition, and what I really mean by that is we have a bunch of these infrastructure services that we need to put together. Certain jobs require certain services. Some jobs don't require them. Some jobs require a lot of them. Some jobs require a few of them, and we haven't had a great way to put these pieces together so that things behave predictably. So, enter system D inside the container. So, again, this is an interface for interacting with kernel abstractions. We really like this because it allows us to start processes with all the properties that we want like the D-Bus API. We get service definition with units. So, units gives us basically a clean interface and a clean API that users can write to. They can define their services in a way that we can parse, we can understand, we can validate, we can test. We can ensure that everybody is doing it in a consistent and similar way. We can start processes that are configured properly. So, if we want to make sure that it's in the right slice, if we want to make sure that it has the right C-group limits, we can do that nearly atomically because we just make a D-Bus API call, and when that thing returns, we have a process that we can start looking at and managing, and we can do this inside a container. We want to be able to manage dependencies properly. Service B needs to start before Service A. It needs to start after we initialize the container. We do some security setup. We start SSH and things like that. We want orderly and controlled container startup and shutdown. So, again, this gives us the ability to make sure that when we start a container, we know exactly what the end state is going to be. We can predictably debug it. We can actually understand what the ordering should be before we actually execute and then compare it against the actual results. We want proper signal handling. We don't want to have to blast everything out, blast the signal out to everything. We want to be able to just tell PID1 to shut down, and it will do all the right work. The consistent service definition. So, what this really means separate from the unit file is that we have the ability to have things defined in ways that are similar, so that if we have a bunch of services that have to run as infrastructure services in a container, we have a sort of a common pattern that we can use so that we don't have a whole bunch of cognitive load to have to reparse and re-understand everything for each service. This sort of links into transferable knowledge. We use system D on the host. So, if we use system D on the host and we use system D in the container, when you get into a container, you don't have to relearn how all of this stuff works. In the busy box world, it's a completely different infrastructure. So, you're on a host, you learn how to use system D on the host, and then you get into a container, and it's just completely different, and you have to write all new tools in order to interact with and monitor and interact with that thing. And another big part is testable services. So, we have a hard time testing, we have a hard time testing services that are composed together with busy box because you don't actually know how the thing is going to behave until you put it inside or try running it and see what breaks, and then you have to do that across maybe a large number of services or a large number of instances in order to actually start seeing some issues. The nice thing about system D in containers and with the unit files is that we get the ability to test those things separate from the container. We can actually test them outside the container, unit test it, make sure the thing works as we expect, and when we put it into the container, it does what we want. And if it doesn't do what we want, it does the same thing. So, that leads into what we started doing with system D services is a thing called we call composable services. So, it's a cute little name that's basically an application of the portable services concept. So, each of these services are defined as individual self-contained components. They have their binaries, they have potentially all their dependencies, and they have a service file that goes along with them. These things are bind-mounted into a container file system at a particular path. We chose to run-time with system D so that we don't have to have a custom image build or we don't have a custom build with that service inside of it and then deploy that thing separately. We can actually have one common run-time file system or run-time container that we can deploy out to a bunch of different services or different jobs, and then they take the services that they want enabled and they just sort of plug them into the side with the config in the spec and then the thing just gets turned on. So, the cool part about that is we can update them independently. We can update our base container run-time infrastructure or file system without having to incur the cost of users having to rebuild all of their services as well. And it works with a trute or not. Portable services, currently it will create a trute at the root of the portable service path so all the dependencies are contained in there. That works really well for us for doing things like modeling the way docker images work or the way docker containers work where all the services are lumped together. We also have cases where we don't have those dependencies because we statically compile most of our binaries. So, the only dependency we have is on our C++ run-time, our L for run-time. And so, we just need to make sure that thing is in place and we can run the thing without having to actually run it as a trute. So, there's some use cases here that are pretty interesting for us. We call these sidecars. This is for log handling. So, when stuff gets written out to disk, it goes through and gzips it or ships it off or puts it somewhere. There's lots of different cases here. We actually have a number of different log handling cases that we have had requests for and this infrastructure gives us the ability to satisfy those requests without having to incur a bunch of new infrastructure. Compliance verification. We're a publicly traded company and so we have various regulatory requirements that we have to meet and this gives us the way to actually do that in a way that is scalable. We've got to clean up temp files, clean up stuff that's downloaded, that jobs download, and doing remote file system mounts, cluster, NFS, any kind of thing like that. Some kind of fused file systems. But also, it gives us the ability to do workload co-location. So, users can actually compose their services together, schedule a container to run on a particular job or on a particular machine and they end up with three or four services running together inside the same process namespace, inside the same mountain namespace. And what this... The difference of how that worked with Busybox was people had to basically write a whole bunch of nasty bash scripts in order to make their stuff run together. So they'd like spawn processes, put them in the background, and then when the container would die, maybe they'd process didn't die or they don't even know whether the process is running or not because they have to roll their own process management. So, how does a composable service work? So we build a package with a binary and a unit file. The agent provides metadata into the container about the service that should be invoked or that should be included. We bind out the package into the container at runtime and into a particular path and the generator, we have a generator that enables the configured service to start at runtime. So we do this early on in the start process and we can deem and reload the thing to sort of, if we change the metadata, we need fixes, we can deem and reload and deem and it will restart to re-enable the service in the right place and then prop it. So here's a little snapshot of an experimental portable service I put together. Just what the file system looks like. You know, there's just a path that's mounted in. We have a Miao binary system de-directory that has the service file and the timer and an OS release file that has some specs and information in there. We have a unit file that has everything you can do in units. It's wanted by a multi-user target. We're going to support the ability to have different kind of targets that people can specify. So you have a very fixed startup or a very fixed ordering of when these services can be enabled. Yeah, and then the thing runs when you spin it up and it just does this thing. You can see that the unit file is loaded into a temp file system location with run. What we really like about the way system desets this run stuff up is that it gives us a very specific and explicit place where we know runtime data exists. So we try not to modify anything in Etsy because Etsy is about what comes with the build, what comes with the image when it's built and then run is where we do all the stuff that's runtime specific. So there's a bunch of benefits here. It gives us this concept of service plugins so that we can actually have a bunch of different teams writing different services that get pulled in from running containers where we as the Tupperware team don't necessarily have to manage and maintain all these sidecar services. We can outsource that to other teams that have more domain knowledge and we give them essentially a clean abstraction like an SDK to build against. It gives us a predictable behavior. We can understand how the thing is working. We can understand how the startup works. We can understand what bind mounts are needed, what components are needed inside that composable service. It gives us a defined contract that we can hold people to and also like unit test and validate. We can lint on these things and make sure that they're correct. And most importantly it gives us the ability for these services to be testable outside of containers in controlled hermetic environments. So why is all of this important? We have a pretty big system with a lot of moving parts and so there's a lot of things changing constantly and without having clear defined APIs and clear defined contracts it makes it very hard to manage and maintain this thing as it grows and as it continues to grow. But also we have lots of people working on this thing. So we have a lot of code changes and we have a lot of people that are poking at these systems and by having a clear definition of how you build a service that is self-contained and then composed into the container it gives us the ability to put the weight on other teams so they can control their own destiny and they can ensure that when they write something it's actually going to work in production. It makes it predictable so that we have a system that we can reason about we can write tests on we can make predictions about whether the system will fail in certain cases and more importantly it makes it fully supportable so that we can actually scale this thing out and add more and more services inside cars and have different kind of implementations that we as the one team that's running this particular infrastructure maybe haven't thought of, right? We have a lot of different people working on lots of different stuff this gives us the ability to actually scale it out not just technically but also the human organization part. Okay, that's it. Thank you. Do you have any questions? You've got a... So you mentioned that you have SSH running in the container I think? Yes. How does that work given that SSH has a boat loader of dependencies? Does that require to provide the container configured in a certain way? The SSH is configured based on how we as the infrastructure team decide and the security team decides to set it up so the users actually don't have to configure that particular component. But what about all the dependencies like PAM and so on? That comes along with it. It's part of the base image build that's part of that we deploy out to all the containers. So it's just included by default. It's not a question more of my comments. I would just like to congratulate Facebook on making something reasonable because actually the whole part of the world is stuck in this mantra of single process single container. What it created is a pathology of, let's say Kafka inside of the docker when Kafka is not designed to work as a PID one. Actually it's not designed to work but other discussion and docker is not designed to control that process. So right now we are stuck with the situation when I have to ship the Kafka in the docker to the Kubernetes because C-level management demands that because somehow Kubernetes right now bumps the stock price and Facebook done something reasonable. Thank you. So along those lines like the interesting thing about these concepts, I don't have time to get into all the possibilities that we can go to but the capabilities conversation that we had earlier the interesting thing about running everything as root problem is that using this way where we have a process manager inside that container that's in its own process name space that can set capabilities on the various services that are executing inside those containers is we can actually make decisions about how we want to limit capabilities for the workloads that are in there. We don't have to make the blanket statement that nothing can run as root. We can run a whole bunch of stuff as root and we can run one or two things as a different user with a different capability set inside of that workload and we give the users and our service owners rather than sort of forcing what is the phrase forcing a circle through a square I don't know what it is. Question. We have like one minute. How do you manage secrets inside applications? How we manage secrets? So we have a process by which they can they get sent through a secure channel into the agent the agent writes them into a secure place in the container and then they're basically like only in RAM they're not written to disk anywhere, that kind of thing. I think we're done but can we take one more maybe? Or two more. Could you talk a bit about any policy or static analysis that you do on the unit files themselves? Right now it's at the current state of things it's basically like the possibility of doing static analysis and linting exists. There is some analysis and linting done on I believe unit files that exist on the host in some cases, but it just gives us the ability to do it. We treat it like source code so it's something that we can actually parse and look at and validate and start flagging things for it. We haven't built a whole lot of infrastructure to do that yet but again this is about like the possibilities that are available to us using this. We're done. Sorry, ask me after. Cool, thank you guys.