 So, hi everyone, I'm Michelle and this is Joe. We are from Facebook's client platform engineering team, also CPE, not the community platform. Naming is hard, right? There are so many, not enough TLA's to go around. So this talk is going to be about how we manage our Fedora desktop fleet at Facebook and why we are managing the client fleet. So this is the agenda, first why we want to manage our client devices and then why it's actually hard to manage our client devices as opposed to managing servers and then we'll give an overview of what our fleet look like and how we manage them. And then we want to talk about, this is the most interesting part, how we want to actually collaborate with the Fedora community to actually work on things we both care about and things that we care about as well. So why? User experience, Facebook is a big company, we onboard a lot of people every week and we want the experience to be good, right? You come in, you get a new laptop, everything should be set up for you automatically or after a few years you get, you replace your device, everything should, anything that you need to work and be productive should be there for you. Developer access, Facebook has a lot of internal tools that we build ourselves, we want to make it easy for the teams that actually write these tools to be able to write them, package them, deploy them and have them work out of the box on the teams that actually need to use them. Security, I mean like there are some security standards we have to maintain, like setting up for this encryption, making sure that if someone lose the device we can actually wipe the device, making sure that security updates are applied automatically. And auditability, we want to make it so that when people make configuration changes they are actually tracked somewhere. So we know what's happening, it's being applied on any device. And also why it's hard, it's hard because the supplier is not a server, there's a person sitting in front of the keyboard. So there are things that you can do that will be like breaking someone's flow. You cannot assume that you don't want to say like run something that's CPU intensive and then someone's compile job becomes much slower. Desktops are not connected all the time. And we cannot manage the desktop when it's offline. It's also hard to actually always make sure that you can remote into a machine to troubleshoot if there's an issue. And those are just a general challenge. We have Facebook specific challenges as well. We are really big on IPv6 partly because we are such a huge company. And some things don't work so well in IPv6 because nobody else actually has an IPv6 on the environment. We'll get to the details later. And then this is similar to what Matt talked about earlier today, that federalize too fast and too slow at the same time for some people. We have some teams with unique needs that always say, oh, we want to be on the older release for a bit longer. And they keep getting nagged by GNOME software, hey, there's a new Fedora available. Do you want to upgrade? And the people that support those teams are complained to us saying, can we turn off this update notification because when they upgrade, our tooling don't work. Teams that have either third-party tools or funky hardware and by funky I normally mean NVIDIA. And then training. Before we choose to standardize on Fedora for our Linux users, this is even more of a serious issue because imagine you are a help desk technician, you are hired and you are told, hey, most people here use Macs. So we hire from, say, we hire those Apple geniuses. And then someone comes in and say, my arch Linux cannot connect to Wi-Fi. So it's getting better but it's also getting worse because now when you tell them, hey, Fedora is a supported platform and now people go there and they have the expectation that it's supported. So you know everything to make this work. So we are not a majority Linux company at the desktop. Our servers are mostly Linux but not our desks. Most people use Mac OS or Windows. So it's an interesting experience. If we get Fedora right, we are going to see more uptake, hopefully. So in the past, we don't actually, some people want to run Linux on the desktops and they come to our team and they say, can you support it? And we didn't set a direction. So we let people pick whatever they are most familiar with. So our Linux users are mostly on Ubuntu and then there's a vocal community that uses Arch. This is not a feature to support it. So we find that in our experience, people will contribute the minimum amount of fixes that they need to actually get what they need to work actually working. So after that, they just throw it away and my God, they don't touch the code anymore. So it's a bit of a nightmare. So last year, we decided to standardize and pick Fedora. And multiple reasons, one of them is that we already run centers on our server fleet. So we have all the expertise to actually deal with RPMs. Because we use centers, there's already a lot of internal infrastructure needed to support it, like how to build RPMs, how to actually deploy them. So for instance, a lot of our internal tools, like we have a slightly modified version of Mercurial that's shipped as an RPM. The same package can be used on both our server and our desks. So this is the view of our client fleet over the past year also. As you can see, we finally became a Fedora majority, the desktop fleet became mostly Fedora earlier this year. And this is the adoption curve for different Fedora versions. So it's normally quite easy to persuade people on ThinkPad laptops to switch to the latest Fedora version as soon as they are released. The people who are lagging behind, as you can see, Fedora 28 dropped precipitously pretty much around the time that it reached the end of life, but not before. Those are the people on workstations with NVIDIA. We have to make them and make them and say, hey, come on, we have to actually move away from this. And yeah, we still need to get better at actually forcing people off unsupported releases, because as you can see, Fedora 27 actually has active usage months after it was end of life. When I reached out to one of them, he said, oh, I didn't know it's end of life. So that's pretty much what our fleet looked like. I'm going to make a photojournal for how we actually manage all the software. Hello. My name is Joe Chalko. I work with Michelle on the client platform engineering team. I primarily work on macOS, but they let me play with the cool kids in the Linux pod sometimes too. And like you mentioned, some of the difficulties in managing client platforms can be pretty unique. For instance, imagine any one of your servers could go to sleep in San Francisco and wake up in Budapest and then expect to connect and update all of its configurations. And each machine has a unique snowflake sitting at the keyboard who may be part of a group of similar snowflakes. And so many of the configurations that we make, we build them to be configurable down to the machine or user setting. And to do this, we use Chef for our configuration management. And at Facebook, we use a specific API model with Chef, which deviates a little bit from the core Chef tools in that we can then set a base config for the fleet and then allow end users themselves to update certain portions of that configuration for their own machines. Chef runs in a declarative way, and the run list in Chef is by design in order. So we take advantage of that run list order to be able to make changes as Chef proceeds through its run. And Chef actually does a few different loops through all of the cookbooks. It loops through everything to compile all of the settings into a node object, and then it loops back through to execute on the resources. And so we can take advantage of that as the compile phase runs and make changes to that configuration based on groupings, username, serial number, network, any sorts of criteria that we might want to use. And the lazy evaluation is another way that we can delay configurations that are to be applied to resources at the end of the Chef run. The API model, like I said, is kind of a specific Facebook thing that we use Chef in the way that we use Chef. We manage the platform, but by default, all of the settings are either going to be nil or false. And then as the Chef run proceeds, we start filling in those configs. Like I said, we can save for the entire fleet. You'll get one setting, and then as that the Chef compiles the process moves along. Then maybe a specific group updates one of those settings, and by the time the resources run, they apply what is specific for that machine. And we also promote user choice in that most of the, any of the API settings that we have in Chef can be changed by individual developers themselves. And everything goes through change control, source control process, peer review, so anybody who wants to make a change to their system can get somebody else to sign up on it, and perhaps maybe a security review as well. And because everything is in source control, it's all auditable, so we know that you, as an individual, made a change to your specific machine, and then they can track back and look at the notes and see why that was done. Example for here is a screen saver, and this is a good example because it's cross-platform. Chef is platform agnostic, so we can run the same resource on Mac OS and Windows and Linux, and the underlying resource will take care of the platform-specific code to make those changes. So we can set a default value of true and max idle time of five minutes, which is mean, probably, do we do that? It's awful. And then when the resource runs, it knows what platform it's on, so it runs the specific calls to the APIs to make that. Screen saver change in this example, and again, we can set that default to five minutes, but somebody may say, no, I need it to be 10 minutes, and they can go in and make that change. And then in the Chef run here, you see halfway down, it's changing. It's actually going from 10 minutes to five minutes there, but it detects that the initial base setting that we set for the corporate fleet was 10 minutes in this example, but this user decided they wanted to be five minutes. Another example was password policy, and again, it's a cross-platform recipe in Cookbook within Chef that can run on all of our platforms. And the individual users could potentially make changes to their own policy. In this instance, we would want it to go through some sort of security review as well. And again, the setting, the policy on the resource level is platform-specific, whereas defining these settings is platform-independent. So one of the things we have serious problem with is running package operations. It's not so bad on Fedora, Ubuntu is way worse because by default, the first time you log into a desktop, it starts running this update application that tried to update the repo and then tell you how many updates there are. But basically, by default, Chef basically encouraged you to actually use their package resource to actually say, hey, I want this RPM to be installed. And imagine if you have 20 Cookbooks and each Cookbook has five recipes, and each of them basically say, I want to install something. You have 100 package operations going on, each of them can basically be blocked because, like, oh, someone is holding the YAM block, help. So what we do is, like, we have a two-prong approach. We find that in the past, a lot of tool authors basically have over-optimistic assumption about how useful that tool is. So they'll say, like, hey, I'm going to install this tool on 20,000 Macs or, like, on all the, like, 500, say, Fedora machines. And then we find that it's not actually being used that much. So we can move them to be installed on demand. So we can say, hey, we are going to stop installing for you and when the user actually invoked the tool. We say, oh, hang on a minute. You actually let me install the tool for you. The other thing we do is, instead of telling people, instead of every recipe actually running a package installation, we encourage people to use our API and say, tell us what you want to install. We'll have a batch job that we configure a system D and it will come along every one hour and say, like, oh, these are the packages that need to be installed. Let me run a single transaction and make sure they're all installed. So here comes the exciting part. What we actually want, need help from the community and also want to contribute back on. I tried these Fedora and Facebook colors for this. It's kind of hard because they are both really, really similar colors. So on the left part is what some of the Fedora initiatives on the right side is what we care about and as you can see, there's a lot of overlap. We run workstation on our client fleet. We are looking at Silverblue and see whether it might be a good fit. It depends on what we can do with containers. One of the roadblocks is that there are a lot of things that we manage at the system level, like we manage certificates, we manage monitoring tools that those might be a bit difficult to replicate. We want to get better at QA both on our end and also on helping Fedora with their CI initiatives. One of the problems we used to have in the past is we continuously install the network installations of Fedora every week. In the past, we actually installed them with updates enabled and what we find is that every Fedora release, there will be a post-release update that accidentally broke network installation. Sometimes we discovered it, sometimes we find that someone already reported it. So we ended up switching to actually say, let's do an installation without updates. So at least we know that once it works, it will keep working and then we apply security updates when we start configuring the machines. But that's a bit not ideal. GNOME software, there's a lot of feature requests we would like to make and also maybe help implement. Silver and CoroS, we probably are not looking at that now. And then there's the Facebook-specific issues. IPv6, which we can probably contribute fixes for and how to deal with NVIDIA hardware, which is probably sort of in scope but out of scope because I guess Fedora cannot really solve it. So some tools we care about. Network Manager, we have problems with this in some of our newer offices where we only deploy IPv6 and we find that Network Manager doesn't actually finish setting up a connection. Both Wi-Fi connection and VPN, if it couldn't actually get the DHCP v4 release, you just get confused and say you don't actually have a connection. We opened a bug a few months ago, but I guess... Not enough people working on it. GNOME software, we have some issues with package management in general anyway. One is that, as I said earlier, GNOME software will tell you there's a new release coming out and we couldn't find a way to actually turn it off. So for our user base that need to stay on Fedora 29, the only solution we could find is basically disable GNOME software altogether, except then there are other issues that happen if you actually try to run a tool that's not installed, by default it will try to actually ask GNOME software to install it and that just fails, saying like, oh, I don't know what actually runs this Diba service. Ideally, that's tunable and also we can actually keep GNOME software from managing flatbacks and firmware updates, but not for RPMs. Version locking, we use DNF automatic to apply security updates. We find that by default it doesn't actually honor the version locks that we have in the version lock plugin. So we have to basically configure Chef to actually say, hey, when someone say they want to lock a version excluded from DNF automatic, so it doesn't get updated. And again, GNOME software because it has its own RPM back end, so we have package kit and DNF and package kit also doesn't understand version locking the same way that DNF does. And GNOME keering. Currently, I was told by our security team that GNOME keering doesn't actually speak TLS 1.3 and as a feature request has been out for a while. On the process side, we want to do better QA both internally on the Facebook side and also participating in Fedora's QA process both for package updates and for distribution upgrades. And this is kind of like a reach, but ideally if we can help push vendors like Lenovo to actually certify Fedora or CentOS out of the box for the machines, that will really, really help us right now it seems that any vendor that say they support Linux mean they have some special room to build. NVIDIA, I mean, if I'm not sure what, is there are some Fedora developers helping NVIDIA actually improve their tooling, right? So we need to talk. So yeah, that's our presentation. So we'll just open up questions, anyone? So I'll repeat the question and then answer them. The first question was whether we can dedicate resources to actually help Fedora on the desktop and the answer is sort of maybe. So to put a picture, we have a team of 12 people. We manage three OSs and two mobile operating systems. So realistically we have two people, two or three people working full-time on Linux and not by full-time, I mean, we also manage other things as well. So, but yeah, I mean, this is one of the things that's blocking us and Facebook internally sort of work as a community anyway and sometimes people do things to scratch their own itch that's not really part of their team's main work. So we might be able to find some volunteers who also find this important and help fix it. But otherwise, yeah, we do have some manpower dedicated for this but I would say it's kind of probably on the order of a few hours a week instead of someone actually working full-time on a tool. The next question was how to handle package operations if there are conflicts. We find that the Fedora default right now works for us. I mean, by default it will just fail, right? And it will say, oh, I cannot actually satisfy this operation and at least it doesn't leave the system in an inconsistent state. In that case, since we are trying to move inside Facebook, oh yeah. So apparently what the NF does is it will skip packages that cannot be installed but will do everything else. So yeah, actually that would work for us since we are trying to shift to a place where we actually say we run one batch job that installs everything we care about. It's preferable if as much as possible actually runs. So do you mean whether we actually allow people to run non-English locals? Yeah, we have no policy on that. So basically I'm pretty sure users outside of the US actually set the machine to their own locals. For our Linux users, most of them are probably, I cannot give specific numbers, but most of them will be in the US, the UK and Ireland, so it's probably going to be mostly English anyway. I would say high hundreds and we hope to get into thousands sometimes soon. Oh, sorry, yes, how big is our user population? So the question is whether we actually collaborate with our server team and the answer is yes. Actually these are the two repositories that we open source. The first one is managed by the server team, the second one is the one we manage. We actually try to reuse as much of their tooling and cookbooks as possible. So the question is whether our users have root and whether they can actually run other desktops. And the answer is yes to both. So we have some system monitoring tools running, but we also don't want to disallow people from doing whatever they want. So we have some people on KDE, we have some people on i3, it's quite popular. One snag is that we make this assumption that people have GNOME installed. So if people, the further people stray away from that, the more they might have to do things for themselves. Yeah, everyone basically, yeah, everyone has root access. One of the weird voltages we had was actually when someone actually installed Spotify to snap on their desktop. And I'm not sure what happened there, but somehow like they had a partition mounted over ETC and shadowing some of our files, that was fun. Yeah, we try to increase flatback as much as possible because it doesn't have that issue. So, yeah, what are some of the biggest issues we've had with Fedora on the desktop? One of them is people who basically really prefer Santos or something like Ubuntu LTS. And so they complain about having to upgrade at least once a year. And Nvidia hardware support is another big issue. So I find that, yeah, it's mostly the same people in both situations. So they find that when they upgrade from Fedora 29 to 30, for some reason PlayMove is broken for them and it doesn't display the graphical splash screen for unlocking your hard disk. And then they say, oh, we cannot use Fedora 30 because we want that shiny thing. So, yeah, it's a lot of... The problem is that to replicate the setup we basically need to have one of their workstations try to install the exact setup and basically that's a lot of unpleasantness to do with binary packages. Part of the reason they cannot upgrade as soon as Fedora comes out is also because the Nvidia drivers might not be ready. In the ballpark, I would say probably maybe about one third. It's basically our desktops. Oh, sorry. What's the percentage of our fleet that actually has Nvidia hardware? And the answer is all the desktops. The question is, do we work with Fedora, with Nvidia to improve Fedora support? It's an ongoing process. We are trying to get them to at least 35 centers on their server models. So, like, is it really, really desktop? Yeah, it's actually workstations that, you know, like, manual voting stations sitting under someone's desk. Oh, sorry. The question was, when I say desktop, do I actually mean desktop or I just mean, like, a client machine? The question is, what's the process for fixing an issue if there's a bad package going out or there's a bad configuration? And the answer is, so we have monitoring, so we can actually see, like, hey, of all our chef runs, what are the top errors at the moment? And most of the time, we can basically just fix it by making a commit and eventually the next chef run should fix it. Sometimes there are cases where something really bad goes wrong and, like, either, like, a chef stops looking or we have an out-of-band remediation that basically tries to actually fix chef and run it manually. But in the worst case, we basically, that's why we have help desks. So we can tell the user, hey, go to the help desk and they will help you. The question is whether we have user support. So it's probably similar to Ask Fedora. We basically, we have an official support. So basically, for platforms we actually support, we promise that we will actually get to someone and answer their questions. We try to make sure they go to the help desk first because that way it scales better. For people who use Ubuntu or Arch Linux, those are technically not really supported, so we encourage the community to actually help each other. Good question. We, sorry, yeah, the question was whether we support multi-user environments and the answer is no. So it's a bit simpler that way. I would say the average user probably has one point something machines. Yeah, some of our user population, like I keep asking for this as well because I say, hey, we have expensive workstations and can we share this, please? But our security team configure things in a way that it's actually hard to do this. I mean, we want to make sure that if something goes wrong, we can actually trace it back to 1%. And if there are multiple people on the machine, it's hard to actually, for someone to be accountable. Good question. A question is whether, since NVIDIA doesn't really have a good track record of working with the Linux community, are we considering AMD? The answer is yes and no. Some of our teams, unfortunately, are wedded to CUDA. So it's going to be really hard to proceed them to switch. But from the desktop feed, we are actually really excited by Lenovo actually bringing out ThinkPad T-Series with AMD GPU. So we might basically just basically start supporting it and see whether it works or not. Next question was from someone on this side. That's a good question. And we should have put that in the slides, actually. So the question was, how do we do phased rollouts and how do we actually make sure that if something breaks early, we actually don't continue the rollout? The question is, yes, we do. And it's up to the person doing the rollout. So basically, you can configure when you want the package to go out who actually gets it first. So you can say, hey, I want people. So for our own updates, we normally say, we are going to dog food it first if it works. If it's a Linux update, we have an early adopter group and we push it to them and get their feedback. And after that, it's basically just a shorter rollout. A shorter rollout. So we basically compute a hash from your machine ID and gives you a short from 0 to 99 and then say like, oh, let's do 5% of the population. Let's do 15%. The question is whether we force updates and how. We don't, which is why if you can see from our adoption curve, it's kind of lagging. We nag people to update. So not so much when a new version of Fedora comes out because, hey, if you want to be on 29, GNOME Software is already nagging you anyway. We nag people starting one month before Fedora goes end of release. And we start nagging more and more aggressively about two weeks before end of life. Yeah, we are. Yeah, even if anyone has experience basically forcing people to update, we really want to hear from them. The question is how people upgrade. And the answer is we support both upgrade path. So some people upgrade from GNOME Software. Some people upgrade from DNF. So if people want to actually reimage Fedora, so what happens because we have to bootstrap all our tooling, they need to actually be on especially what listed network villain. So people try not to do that because then they need to go to the hub just to do it. So I don't want to keep people away from lunch. So any other question? All right, thank you.