 Cool. So my name is Zeal. This is my teammate Zoltan. We are here today to talk to you about a little bit about how we run our containers at Facebook, specifically about how we use tools, open source tools to build our container images. The title of the slide is a little misleading. It was originally supposed to be given by a different teammate, but he couldn't give it. So we kind of restructured it to talk a little bit more about these open source tools that we use. So first we're going to go through a little bit of background on Tupperware, which is our container platform. Then we're going to talk, Zoltan is going to talk about Tupperware and then ButterFS and SystemD and how we use those within Tupperware and within our image build tool chain. And then I'm going to be talking about how we use Chef to kind of tile these things together and finally kind of wrap it up with some overview of how we build these images. Hello. Hi. I am Zoltan. Welcome. So let's talk about a little bit about Tupperware and what it is. So Tupperware would be our containerization and scheduling system within Facebook that we use to run a lot of jobs. Essentially the intent of this system is to manage resources both on cluster and host level. We have a high granularity of resource control. We want to run all of our jobs in isolated environments, so we make sure that our jobs don't interfere with each other and also for security purposes. Our resource control go from cluster level to rack host level and then on the host we can also control memory, CPU, IO and other resource consumption. We also support a lot of things like different styles of jobs, restart and update policies, co-location parameters. So essentially it gives a lot of flexibility on how our users want to deploy and run their services. So this would be like a bird view of the architecture. So users will usually specify a job configuration which is a domain specific language, a text file, something similar to Python. They put all the details they require, the binaries they want to run and then they use some form of CLI to send it to our schedulers. Schedulers have this concept of a job that is tied to this configuration file. It will separate the configuration into two main pieces. One is that related to scheduling and location and the other is that related to the actual job that we have to execute. We will send the job related details to the host themselves where the host will be responsible for running the binary that the user has requested. The scheduler maintains a state of all tasks and hosts. So in case of host failures or reallocations or rescheduling, we can move tasks around and all our binaries that we actually require to run within our containers are fetched from our distributed package management system. Topover provides both shared and private pools for users. Shared pools is where a lot of people just launch their jobs and we collocate them with other random user jobs and then we have some specific users that require a set of pools because they want to have specific settings or they just want to have a dedicated set of resources so we just carve them out for them and they can have their own domain there. You might ask why we don't use some existing industry solution. One of the main reasons is that when we started working on Topover, none of these were really readily available for our scale. Also, we want to achieve high level integration within Facebook systems so we can do a lot of optimization and efficiency improvements. Why don't we use VMs? Well, the resource overhead is actually not worth it for us. Containerization provides good enough isolation for most of our jobs since we control the binaries and the hosts. Also, it's much harder to debug issues in a VM environment so we want to avoid all of that. How did we build images previously? We would create a fast system shoot. We would yum bootstrap inside of it. Then install some more packages. We would install some custom configurations. We would replace any system with busybox in it for legacy reasons and then we would bundle this into a tar ball and distribute it as a package to the fleet. Building was done by a set of shell scripts and we also had to pin several versions of the packages which would cause problems from time to time. Also, customization was a little bit hard due to these hacky scripts. Once this golden image, so to speak, was deployed to the fleet, when we would start up a task, we would download this shoot file. We would launch the container with a form of LXC and then we would perform further customization for the customer. Since we had only one golden image, every customer would have to configure their own container environment which would include installing more yum packages and doing more yum configuration. Then finally, we would launch the main process inside of the container. This customization phase actually also had beyond installing packages a pre-run command which was effectively a shell command that we would execute in the container before we start up the final commands to run. The packages were pinned to the version when the job was first started which would cause some interesting bugs from time to time. We would install via yum on every start and in fact, if you had multiple containers running on the same host or the same type, we would install into all of them. So that was a bit high or heavy. The pre-run command was intended for simple stuff like, oh, let's do some CH mod or let's run some background service or something like that. But it was never really intended for complicated stuff. Of course, our users are very resourceful so they started working around this and implementing effectively batch scripts in this pre-run command with a percent, a percent and writing configuration files and this was really not okay because it was really, really hard to debug and really, really hard to find to pin down the actual issue within this complex pre-run command. So the drawbacks of the whole system were factually that we had this one single golden image that someone had to update with a list of hacky scripts. Then once it was done, it had to be deployed to the entire fleet. There was like no customization per tier or anything like that. We didn't really apply any customer or application specific customizations. Pre-run command is awful and the task setup was repeated every single time when a task was run. So the solution for that is using some modern tools to actually achieve something better. So what do we want to achieve? We wanted to achieve a type of image build system where we can build both generic and customer specific images and this would allow us to provide with more flexibility, with more control and also we wanted to have reproducible builds. We wanted to have transparency so we can actually understand what's happening in these images. And we imagined this layering system where you can build customly defined layers on top of each other and then we can ship those to the fleet as required. So battery FS, it's a next generation file system. I would say most of you are familiar with it but if not, think of similar to ZFS. It has a lot of features but the main features that we utilize to achieve our goals is the copy and write feature, the sub volumes, the snapshots, which are a special type of sub volumes, quotas and the advanced C groups, IO control. Read only and read write snapshots are also really important to produce immutable base images but provide with a read write environment to any of the actually running tasks. So battery FS gives us much lower disk space usage since we can share a lot of base data. It obviously improves our IO disk usage since we have to write simply less. This data caching improves our startup time since we, once a task is running, we only hit cache and we can independently actually version layers with this and we can also provide a different update schedule for these layers which is really important when upgrading the base system or customer specific stuff. For example, imagine a customer wants to build their package or their layer once a day or maybe multiple times a day then we can accommodate them without actually having to rebuild the whole stack of the images. So how does it work? Imagine you want to launch a task on a virgin host. What you would do is we will pre ship your base OS onto the host so it's always already there, it's always the latest, it's always up to date with the security patches. We also pre ship the Facebook customization. We install SSH certs, some other fancy things firewall, you name it. This is like the standard image that everyone has to build off or run off. And then comes the customer and says like oh please launch my task which has all these binaries and configurations in it. So we just ship a layer, we use butterfaces send and receive which are effectively binary diffs of the file system to rebuild his layer and then we create a read write snapshot from this read only stack to launch their task. If the user chooses to run more tasks then we just create yet another snapshot which is also read write launch in our task. If we happen to collocate another user's custom job we can just apply another layer diff and another task snapshot to run. And if the user has a simple binary that doesn't require any customization they just want to fire and forget their binary we can do that too. Just create a snapshot, put their binary in and run. This gives us a lot of flexibility on how we ship and build these images and it makes the system not only more efficient but gives us modern tools to work with and tools that are familiar to most of the engineers. So butterfaces will do all the layering for us like we don't really have to implement anything, it's just done magically in the kernel. Diff send and receive will optimize also our bandwidth in the fleet. Since we don't have to ship all these huge images all the time we just ship the specific binary diff layer that is required by the user. The end spawn is the tool that we use both to build and run the images. We actually also run system D in it inside of our containers so we are technically open to the pod concept. It's really easy to integrate with our existing tools. It just works. We build the ones and then ship everywhere. I think the only problem we had with end spawn so far and that is because we use it directly is that it doesn't set up C groups for us which we require to control resources. And Zeal will now talk about how we customize our containers and how we customize them. So of course we use Chef at Facebook. We might be a little bit unclear why we're using Chef to build these images. There are a few reasons why we do this. The first is that we have a lot of familiarity with Chef and how it works and we have a lot of tools and experience using Chef that make it really easy to work with. Not just for the teams that are building Tupperware and the operating system layer but also for application engineers who are using these tools to configure their own systems. They already kind of know how Chef works and know all the caveats for using it. So using it to build images is something they're already familiar with. We also have a good relationship with the Chef upstream community. We do a lot of work with them to make sure that we can continue to use Chef as it changes. And we knew that if we ran into any problems rolling out the images across the fleet that were specific to Chef, we could discuss this with the open source community and either get those issues fixed or fix how we're using Chef, whatever is appropriate. So first a few notes about Chef generally at Facebook, not specific to images. So we use Chef APIs a little bit differently than most people. Open source, most of the Chef built-in APIs use resources and providers to implement whatever thing you want to interact with on the system. A lot of our APIs, many of our APIs are built in this way for one reason or another but a lot of our APIs aren't built this way. They use an attribute instead. So if you're not familiar with Chef, basically you can attach attributes to a Chef node that just represent data. And then we have Chef, we've written Chef code that will go through these attributes that you attach to the node and implement whatever they do. And so an example of this, so on the top is a fictional resource that I have made up as an example. On the bottom is an open source cookbook that we provide. There's a link, well, it's on our GitHub page that will set up a system D timer for you. So if you're using something like this top resource to do this, this, of course, would just go in and lay down a system D unit file, two unit files, first a service and then a timer job that will kick off that service using the options that you've provided. The disadvantage of doing this is when you remove that timer sometime in the future, you need to remember to delete both of the unit files or maybe the system D timer implements a delete action. But either way you need to remember to have your Chef code execute that for some period of time afterwards to make sure that that job is deleted. With the attribute API model we have here, you can attach something to the node attribute and make sure that the FB timer's cookbooks default recipe runs and it will just enforce that whatever is attached to the node object is on disk and anything that's on disk that doesn't match what's in the node attribute will just be deleted. So in a Chef recipe that has this, you can literally just delete these four lines and then it will disappear off the system as well. And so that's the primary advantage we get from doing this. We also have a script for running Chef. It does a lot of nice things like manages log files for Chef, manages log files. It's also platform independent. We run it on Windows and OS X and Linux within Facebook. It's also open source. It's on GitHub. And it has hooks that will allow for extending it. And we've used these hooks extensively to be able to use this same script in different environments where we run Chef like on bare metal sent to us hosts or on Windows VMs or inside Tupperware images. So as far as how the Chef code actually is used for images, the first thing we do is we take basically a snapshot of our version control system where we store our Chef code. So that Chef code is then versioned and deployed as a package. And since it's got a version, you can deploy that with your application. So when your application changes and you need to change the config file for it, you can bundle those two changes at the same time in your image. This also allows for really reproducible build. So you can just roll back the package version to get the old version of the Chef code. This is something we can do for images, but we don't do this at all for Chef hosts. For hosts, we merely go forward in time and we never ever go backwards. This is a little bit of an anti-pattern for images, but we find it works really well for managing more application-centric stuff. We also enable a feature called attribute injection. Basically the way this works is we have a ChefCuttle hook that will scan a directory on disk and use JSON files that are present in that directory for defining node attributes. Chef client already has the dash dash JSON attributes option, which does this, but we do a little bit of extra stuff to take all of the JSON files and merge them in a consistent way. So you get a single JSON file that you can then pass to JSON attributes. So since we also use attribute APIs for many of our Chef APIs, this allows you to really easily configure how Chef works by just dropping a JSON file on disk. And so other systems that want to inform Chef what it should do can do so just by dropping a JSON file on disk. They don't have to generate Ruby code, which is pretty easy to use. And as a matter of fact, our build tool chain for images uses this to configure the run list for the Chef run to tailor what Chef code gets run. So we also have a tool at Facebook called taste tester. Taste tester is a Chef testing framework. We use it extensively at Facebook and it's also open source. We've integrated it into our build tool chain for images. So if you want to, you can invoke an image build in a sort of test mode, which will never produce an image because we don't want people manually mucking with their images. We want everything to be defined in source control that actually produces a production image. But if you just want to test out your Chef code, you can run an image build in this development environment and it will test the image build context. And then whatever changes you make in your version control repository are synced into the image automatically. And this allows for really easy development of the images. So one thing we've done to kind of tailor how this image build process works is we've included a system D unit called task init.target. The way this works is the Chef will run an image build time and drop off this task init.target. Then any application specific Chef code that needs to run can run and install their own system D units, whatever the application needs. And any service that gets wanted by or has this wanted by line in their unit will get included during task start up. So during image build we install this and install whatever units and then sometime later it will get run on a host and we will basically system cuddle start task init.target which will immediately start all of the unit files that are wanted by task init.target. So this basically replaces the pre-run command. Any daemons that you need to run in the background or configuration items you need to do at run time, you can set them up as system D units at build time and then we'll get run at task start up. And since they're run by system D you get a really nice graph of like which ones failed and which ones didn't. You get standard output and standard error from all the jobs. It's really easy to use this way which is night and day compared to the pre-run command we had previously. So the way we actually build these images now, so the configuration for the images for a, so the configuration for the images basically do package installation via yum. If you need to do more advanced configuration for your application specific image, we do that with chef. We also, your system D for both the task init and other related tasks. And we also allow, since you can enable task start up time units, you can also enable chef at run time. If you want to like run chef at run time and have it catch up the systems since there might have been some time pass between build time and run time. So for images, we also have a build system to build these images. We do these rebuilds based on changes in our source control repository. So if something changes in the chef code or in the application code, we can automatically trigger a new build of whatever images depend on that. They also build on a schedule. So if something isn't rebuilt, it gets rebuilt automatically to make sure everything is kind of up to date. But we also only ship new images if there are changes. I think, so we basically do this by checking the source control for what is changed and what source control files get mapped to which files in the images. And since we have butterfests, we can ship binary diffs rather than the whole image. So that stack of images you saw earlier, we only need to ship whichever layer changed. We don't need to ship the whole stack. And we're also working on building an automatic build and test validation framework for this as well. So the end result, to build a true image now, we instead of creating just a bare truth, we create a butterfests file system and basically call yum bootstrap inside it. This is for the CentOS layer. If at higher levels we don't do the yum bootstrap, because that was already done in the lower level, but instead we'll customize it with Chef. We don't need to do the package installation or custom configs, and we don't install busyboxes as been in it anymore. Instead, we can just do customizations with Chef and whatever Chef is configured to do, it will enforce that's true within the image. We bundle the resulting butterfests file system into a tarball and then distribute that through our package management system. So task start up now. Basically, you download whatever image the job is configured for. We launch it with system to end spawn. We don't do any set up inside the container anymore. Instead, we start that task in target that I mentioned earlier. And once that has completed successfully, we launch the main process inside the container, whatever that task is configured to run, database or web server or whatever. So task configuration now no longer has a packages or a pre-run command. We don't need to do either of those customizations at runtime. Instead, the only thing it has is an image, what image it should build on top of, and then the command. And that's pretty much it. So we saw a lot of performance improvements when we started rolling this out. Butterfests file system in particular gives us huge IO and disk usage improvements. And we were also able to reduce downtime during upgrade significantly when we made these changes. Since you don't need to do these runtime customizations during start up, you can just lay down a butterfests file system somewhere on your disk and start it immediately. So some of the lessons that we learned doing this, systemd inspawn is extremely easy to use and very robust to changes. We did have some issues when we were using it, but it was mostly because we misread the man pages. But also, butterfests layering is really simple, and you can use it very easily. Both systemd and butterfests integrated very well with our internal tools. But also, we learned basically like shell commands don't make good configuration management for things that are more complex than, say, a simple command. Once you start trying to make multiple commands run together and make them reproducible and testable, it becomes a nightmare very quickly. Also, in our container system, busybox doesn't really make a very good aspen in it. It's just not built for the sort of scale that we're running at. So questions? Yeah? At the back. Can you raise your hand? Okay, sorry. In the beginning, I saw that you were using systemd inspawn, systemd for Tupperware. I see that there's a lot in common, for example, with the rocket container runtime. How, I mean, wasn't there, didn't you think about, okay, what was common between or why wasn't this considered as an option, for example, to be used? I don't know about rockets specifically. Systemd inspawn kind of made an obvious choice because it was already installed on all of our machines. And we had already started using it in a few other cases, which we didn't talk about. But we already kind of had familiarity with systemd inspawn from the get go. Hi, thanks for the talk. What version of ptrfs are you using in production? Are you checking mainline? So, yeah, I can take that. So we are actually using almost the latest. One of the reasons is that there are quite a few ptrfs bugs in earlier kernel versions. I think for us, the recommended version is 4.6.15 or above. And we actually work with quite a few kernel developers, also within the company and outside. And we push all the fixes upstream. So if you use any recent mainline kernel, then all the btrfs fixes should be there for you. Okay, great. And another quick question. You were talking about using ptrfs send and receive for basically exchanging the diffs. Are you pushing those into an object store like S3 or how are you getting those into the production machines? So what we do is we, when we build these images, we also get with send all these layers as the diffs and we just package them up and we just piggyback on our package management system. So when you receive requests to like, oh, please run this task with this image, then what we do is we just follow the tree of layers to the root and then we just fetch each of the layers and reconstruct it. And once it's on the host, then we only fetch those layers that are actually different and required for the new task. Very cool. Thank you. Cool. Any more questions? Okay. Thanks.