 So my name is Zeal, and I'm here to talk about how Facebook uses Chef Solo. So I'm a production engineer at Facebook. My teammate, Davida, gave a really good talk about a bunch of the work that we do on the operating systems team, but my focus of this talk is mostly about Chef. So most of this talk is going to be about why and how we use Chef Solo. In order for that to kind of be palatable, first we need to talk a bit about how production in Chef works, and then finally we're going to talk about some of the results of this effort to make Chef Solo better. So just as kind of a quick background, Chef is a configuration management system. There have been a few talks about Chef already, so I'm not going to go into too much detail, but Chef Code is organized into cookbooks and roles. Cookbook being something you use to install a particular thing, like Apache, and then a role you can use to group cookbooks together. And then the run list is the entry point for a Chef run. We would say use this to configure a lamp environment on a machine. And Chef, sort of the default way that people use Chef is in a HTTP client server architecture where you have a Chef server that's serving cookbooks and roles over HTTP, and then the clients can request that in order to perform Chef runs. So Chef also ships with a tool called Chef Solo. Chef Solo is used to run Chef in a serverless mode. Rather than running Chef in this client server architecture, you can have the cookbooks and roles present locally on disk, and have the Chef client use those rather than talking to the HTTP server. In the Chef documentation, this is called local mode. It's kind of either local mode or Chef Solo. If you're reading through the documentation, you'll mostly find local mode, but for the purposes of the talk, I'm just going to refer to it as Chef Solo. So the way Prod Chef works at Facebook, we have a pretty mature Prod Chef setup that is based around this sort of dual run list. For every machine at Facebook, we have a pretty homogenous environment. They all run CentOS 7, as my teammate, Dolvid, mentioned earlier. And we allow teams to customize how their Chef run works. The way this works is we have a run list that's split into two halves. We have the base role that does all the base operating system stuff and sets up system D, Chef itself, things like Cron, and YUM, and other things that the operating system needs in order to run. And then the teams can provide their own Chef code via this tier role that runs whatever they want. This could be something that sets up HHBM, which is our web server, or MySQL, or whatever. So we have Chef servers distributed throughout our data centers. There's a DNS service discovery mechanism that allows any machine in our fleet to look up what the nearby Chef server is and run Chef using that Chef server. And we have a bunch of tooling around this. Chef Cuddle and Taste Tester are particularly important for this talk. Chef Cuddle is a script that will run Chef, babysit the Chef process to make sure it runs successfully, and then make sure that the output of that Chef run, both the exit code and the logs, go someplace useful. And Taste Tester is a tool for testing Chef changes that will spin up a development Chef server, push your changes to it, and then you can configure a production host to use that development server rather than one of the production servers. Both of these are open source, which I'll talk about later. So overall, we have a really mature production Chef setup. We spent years fine-tuning this and making it work really well for this homogenous CentOS 7 fleet that we have. But none of this uses Chef Solo, right? So where does Chef Solo sort of enter this story? So the reason we started using Chef Solo was actually because of Instagram. When Instagram was acquired, it was a pretty large deployment on AWS. They had their own version of Chef. They had their own version of their Chef code. It didn't overlap at all with our own internal tooling. So we basically had these two different versions of Chef in production that we needed to make work somehow. In addition, they were kind of given instructions to move into Facebook containers, which would be Tupperware. And they were required to do this pretty quickly. So that overall led to them sort of cutting some corners in a way that allowed them to make this deadline quickly. But they still wanted to be able to use a lot of the tooling that Instagram was already comfortable with. So they decided to use Chef Solo sort of as a stop gap between what we were doing in production and what they were doing in AWS. So that's kind of where Chef Solo enters this story. So the initial implementation of the Chef Solo tool chain for Instagram used the same run list everywhere. So that sort of dual stage run list that I mentioned earlier, they just totally threw that out. At the time, Instagram was using, had fewer than half a dozen different workflows. So for them, it made sense to ship all the Chef code everywhere and run all of it. And then the Chef code would sort of toggle itself on or off, depending on where it was running. So in order to get all the cookbooks onto disk on every environment where it needed to run, they would use a package. We have a mechanism for rolling out sort of tar balls over torrent. And they decided to use that as a quick way to get this up and running. They forked Chef Cuddle so that it would download this tar ball before it ran Chef and then use Chef Solo rather than Chef Client to do a Chef Solo run instead of a normal Chef run. And then as I mentioned, the Chef code itself would inspect the environment, determine whether or not it needed to execute that piece of code, and then just return if it didn't need to execute whatever that code was. And to test Chef changes, you would build a package locally using a tool that they had built to build these packages and then sort of SCP that onto a production environment and just run Chef using that package rather than the one that it would download. So this is pretty different from how broad Chef works. So to make matters worse, we had another team that started off doing pretty much the exact same transition right after Instagram finished. So they did pretty much the exact same stuff I had on the last slide. They forked Chef Cuddle or rather used the fork that Instagram had and started using Chef Solo using a package. In addition to that, our build system team started using Chef Solo to cut down on their startup times for their containers. At the time, their containers took really long time to start up because the build system needed to install a bunch of build tools inside their containers. So they decided to switch to a long-lived container model where they would run Chef Solo to keep those tools up to date rather than restarting the container every time. And another team started talking to us on the operating systems team about using Windows VMs. So this was kind of a wake-up call for us on the operating systems team because none of our Chef code had any sort of support for Windows. Everything we were doing was really invested in CentOS 7. So this was pretty far outside our comfort zone. So just a quick recap, about a year had passed since Instagram had started their migration. There were now three teams using Chef Solo. All of them were using sort of their own Chef code. At the time, there were about three different ways of managing yum configs. So if we wanted to change some option on the yum servers, which we maintain on the operating systems team, and they consume, we would need to go find where they had configured their yum comps and change that. They were also using their own flavor of the tools. Two of the teams were using Chef-Cuttle Solo. Prod Chef was using regular Chef-Cuttle. And then there was another team that was running it in a different way. And one thing that was particularly painful for us was whenever we wanted to make a Chef change, we would need to test it in three or four different ways because each team had their own testing workflow. This was a huge cost for us on the operating systems team because we need to be able to make and test Chef changes really quickly. And we just couldn't do that. So we decided to sort of invest in the Chef Solo tool chain. So the first thing we asked was, what can we reuse about the production workflow and sort of contribute that to Chef Solo to make it better? So Chef-Cuttle is the most obvious choice. They had forked it to begin with. So there was already some common bits. So the core bits of just running Chef and making sure that it runs successfully and logs someplace useful are really good. Like Chef Solo and Prod Chef both need that functionality. But Chef Solo also needed to be able to do other things. In particular, it needed to be able to configure what options you ran Chef-Client with, most notably the local mode option to toggle Chef Solo. And also it needed to be able to download a package before the Chef Run. We couldn't do that within the Chef Run because then you get this chicken and egg problem where Chef needs to install itself in order to run. So we rebuilt Chef-Cuttle. At the time, when we started this, it was a bash script. We rewrote it in Ruby and wrote it with a plugin model so that we could add in or drop down a plugin file that Chef Solo would use to configure this extra functionality. We also really wanted to use Taste Tester. As I'd mentioned, testing costs were really significant for us. Taste Tester already has a mode where you can run it, use one Chef server, and then use that one Chef server to test multiple different machines. We didn't really see a reason why we would need to change that to test Chef Solo. You still just use the one Chef server and then configure either Prod or Chef Solo. It shouldn't matter. And we also wanted some way to configure a default run list for all the environments that were running. This was something we saw was really beneficial for production because we could control sort of what the base OS was doing. And we wanted to be able to do that on all the places where we use Chef Solo. So we also knew that we couldn't use the Chef servers because at the time, the Windows VMs did not have access to them. They were running in a more isolated environment because this was sort of our first exploration of Windows. We didn't really trust Windows all that much. So in terms of distributing the code, we would use a package pretty much the same way that Instagram had been doing it with one modification. Rather than shipping all of the Chef code for all environments in the package, we would ship just the Chef code that one environment needs and try not to include anything extra. This means that we don't have any dependencies on the Chef servers. So it also means that it's really easy to ship this package into isolated environments like the Windows VMs. And we sort of wrote a ChefCuttle plugin after we had rewritten it in this plugin model that would download the package before the Chef run and then sort of unpack it and make it ready for the Chef run. So this kind of provides a problem, though. In order to build that package, you have to know what the run list is. And you need to know that in advance of a Chef run ever happening. So what we did was we provided a tool that allows teams to request a package be built. They provide us a run list and a target platform. And then we include whatever the role team here is, the run list they give us, it's whatever that is, plus a default role that's dependent on what their platform is. So the Windows VMs will get their own default run list and CentOS 7 containers will get a different one. So the outcome of all of this work, after about a year, we were able to onboard all three of the teams that were using Chef Sol on to this platform. It was also we wound up able to combine taste tester and Chef Cuddle, which were two of the most important ones, from ProdChef. So ProdChef and Chef Solo were using the exact same code to test and run their Chef code with some minor modifications. As we developed this tool chain, we came up with several new use cases that were able to use this tool chain without really us having to modify it. So that was things like employee laptops now are starting to use this model of Chef Solo to run Chef. They had been doing something closer to how ProdChef works in the past, but this fit really well for them because it's really easy to ship a package onto a laptop than it is to expose TCP through firewalls. We also use this to manage our phones in our data centers by running Chef Solo on the Linux machines that are connected to the phones. And then using the USB connection to twiddle the bits on the phone. So this really reduced the maintenance burden for us on the OS team. When people came to us to ask for best practices or help with developing their own Chef code, it was much easier for us to help them now that we were using this common platform. Before, when people were using their own tools, it was really difficult for us to provide people advice because we didn't know what they were doing. And it was much easier to test Chef changes. This made it way easier for us to maintain pieces of common infrastructure, like systemd and yum, and much easier for others to contribute to those pieces of infrastructure because one team wouldn't need to learn some alternative test workflow to test their changes for some other team. So that's all. Any questions? I don't know if we have time for questions. Yeah, maybe we are officially into the coffee break now, so we can have questions. OK. Questions or coffee, I guess? I heard in a former talk that you also use Chef on rack switches. How does that work? So Chef on rack switches is using the prod Chef workflow. There are a number of our cookbooks have to kind of take into account that we can't restart them easily. So a bunch of our internal Chef code has sort of special cases that say, don't do things that would require a restart of the network or a restart of the host. But they run basically prod Chef. Is prod Chef going to live a long time, or is the goal to eventually move to a homogeneous system? No. One of the reasons that we did this was now that we can share this tool chain for Chef Cuddle on Taste Test or between the two, there's no reason to merge them further. Prod Chef is really well-oiled as it is. And one of the goals of this project was to be able to support Chef Solo without impacting the prod Chef workflow. Thank you. OK. One more. One more, yes. So you were mentioning this tool that you used to test cookbooks. How does the workflow actually work of that? So when you start from your laptop and you want to modify a cookbook? For a taste tester? So generally, like most developers that work on Chef use a development server that's in one of our Facebook racks. That development server will run the tool called Taste Tester, which is on GitHub, by the way. I'm sorry I didn't have a link to it in the slide. But that tool will spin up a Chef Zero server locally on their development server. And then it will SSH into a production server that they choose, reconfigure the Chef config on that host to point at their dev server rather than a production server. Deadman switch that after a month or after a year. Oh, yeah, yeah. So there's also a on every host in the fleet. We run a five minute crown job that will check to make sure that a host is tested or not. And if it's gone past some expiration time, it will reset it back to production. So we don't get hosts that are kind of like left alone in test state. So thank you, see you.