 Welcome everyone, I'm Michel from Facebook's client platform engineering team. I go by Salim MA on the federal account system and Michel Salim on IRC. We have Davide here as well from our operating system team. So in case you guys have questions about how we manage our server fleet or if you have any questions about centers, Davide will be here to answer those. I'm not sure what's going on in the slides here, it was working half an hour ago. Anyway, so this talk we described Facebook's Fedora client fleet and how it is well positioned at Facebook to be used for cross-function collaboration with external upstreams like Fedora or the internal teams at Facebook that contribute to Linux upstream. So the agenda will start with introduction, which we are in now. I'll describe what I mean when I say that we treat Fedora and our server team as upstreams. I'll go to a concrete example of how we are revamping the way we provision client systems to make it easier to contribute to upstream. And some of the changes in Fedora 33 that we are actually dogfooding right now on top of Fedora 32 and some upcoming projects that we want to work on with upstream. So I've been a Linux user since 1998. I wanted to start a few years earlier but I didn't have my own computer. I've been contributing to Fedora since 2005, mostly doing package maintenance but it's only in the past two years that I actually get paid to actually work on Fedora at Facebook, which is really, really cool. It's my second team here. On my first team at Facebook, I actually manage this. So yeah, scary thing, mobile phones in data centers. These are used in our CI system so we can test mobile apps and find bugs and performance regressions. They used to be a lot harder to maintain than right now because phones were not designed to be fully automated. They assume there's a PAPCAP and in this case, we don't. So we plug it into our automated recovery system. But in case of phones, a lot of the outages basically involve paging an operator to come and fix it. And now I manage. Well, no, I don't manage my caps because they cannot be managed. You might have seen some of you might have seen Merlin, the fluffy one in social hours. He's banished right now because he likes to roll around on my keyboard too much and that might not be good when I'm presenting. So yeah, I'm in the client platform engineering team with Jim who is on the chat and a bunch of other people, although only about three of us actually have expertise in Linux. As Jim said, most of our fleet is macOS and then Windows. You know, like I might be a musclekist, you know, like I pick mobile phone and then Linux, but hey, you know, someone has to do it. It's fun. And yeah, so another view of our desktop Linux fleet. We have on the magnitude of around 1000 laptops and desktops. We switched over from a few years ago from mostly running Ubuntu to mostly running Fedora. And the reason for that is that we are production fleet in data center use sent us and having picking a more similar distribution makes it easier to share things like how we build internal packages and how we actually manage the systems by reusing the same cookbooks. The fleet is mostly Lenovo. Thinkpads on laptops and think stations on desktops. We are also looking at using centers for desktop use and there are some reasons for that. Some of our some of the teams, especially the ones on desktops find it. They prefer the stability of centers, especially having a kernel that gets that part of changes instead of getting a new kernel every every two months or so, like Fedora, especially if they have some performance sensitive workloads. There's also the case that some of these teams need NVIDIA because they use CUDA and therefore they need to use the binary NVIDIA driver, which is it works in Fedora, but it's not really supported. And if it breaks NVIDIA will not do anything to fix it. So it might be better to put them on centers anyway. So what do I mean when I say like we consider Fedora and the server fleet as upstreams? For Fedora, we basically mostly use Fedora as is and we try not to customize too many things. That means when we have issues we can report them upstream, we can work to actually fix it and we can work on upcoming changes. We provision using kickstarts will go over this in more details in a bit. So the reason we particularly need to use kickstart is that with Linux right now we have a requirement to have full disk encryption. And with Lux you have to basically encrypt when you install you cannot just bolt on encryption on an existing hard disk. And it's a super bad experience if after someone went to the trouble of installing Linux and then we tell them, hey, sorry, your machine is out of compliance. You have to reprovision. Dogfooting, yeah. Dogfooting just means internally using a change that's not actually really outside yet. So the idea being that if you find bugs you get broken by the bug instead of your users. The production fleet. Production fleet is mostly Center 7 right now. It's migrating to centers extreme. The idea being that we can catch errors and contribute fixes before they make it to to the stable EL release. It's very slightly modified and most of the modifications are shared upstream. We have a lot of changes publicly in the RPM backpots repo of the main interest is that the colonel is and system D are much more up to date than on the stock center seven. We Facebook has a kernel team that actively work upstream on new current features and we, we also contribute to system D and basically we track system D changes in the server fleet. They will be mostly on centers eight stream by Q one. Next year. One thing that we want to borrow on the desktop fleet. That's already live on the server fleet is they do a lot of critical resource control work on top of sync groups to. So there's a link there to FB text to which is the internal code name for it. It's going to be it's in the process of being upstream to system these so soon you'll be if you. If you've heard of system D only that's based on Facebook only that currently is live internally. And yeah, the production fleet is managing chef. The same way we manage the desktop fleet. We don't share. We don't share all the cookbooks but we are trying to converge on using the same cookbooks everywhere. Yeah, Jim. I wouldn't advise like eating dog food collaboration. Yeah, so there's a lot of avenue for collaboration on with Fedora on testing upcoming changes and on basically reporting and helping fix issues that we find for the for the upstream turning that we use with the production fleet. Some of the changes like resource control go to the server fleet first and it would be nice to basically validate them on the desktop fleet before they make it into say Fedora or into obscene. Another common feature is that centers like enterprise Linux basically they only have a subset of packages from Fedora and those are the packages that red hat is committed to maintaining. If you want to use anything else you need to use Apple extra packages for enterprise Linux and we do use some Apple packages on the server fleet and on the developer VMs. And the way Apple works is that whenever there's a new major EL release Apple packages don't get automatically branched because there's no guarantee that someone actually wants to use that package. So it's, it's a room for collaboration here in that we, since we actually need some of these packages for our workflows, we should probably at least comment on them to make sure that they actually branched and maintained. So going back to collaboration does one other really cool thing that I discovered this morning in the CI talk. Apparently now it's possible to run your own CI server and contribute the results to the federal system, which will be really cool, because we do have some non standard and if we can actually automatically report regressions that will help prevent our users from being broken when it gets pushed. So some organization changes we are thinking of doing. We already have some people at Facebook who actually are federal contributors. These are not really organized at the moment. So it'd be nice if there's some on call rotation that's in charge of packaging so we can say hey, you know this package, we need this package in Apple but it's not there right now, could someone help basically either get it maintained or get it branched and built by, by the maintainer. We, we need to involve our users more in testing, especially the ones that have non standard hardware. For example, when the 5.7 Kono got out, our MSI users got broken because the Wi-Fi PCI device ID was not recognized anymore after, after the driver got refactored personally. So it was fun. If they had been participating in the test day, they would have noticed and it would have been probably fixed upstream or, you know, like Federa wouldn't have released the 5.7 update. I think it's finally fixed in 5.7.9. And yeah, there are some changes that we are working on upstream with other Facebook teams like Battle.FS. I'll try and go a bit quicker since otherwise we won't get to the demo. So one example of how we, we are refactoring of the way we work to more closely track upstream is provisioning. We, traditionally we provisioned with Pixie during network booting. We later switched to IPXE because we find that trying to configure all our offices to make sure that Pixie works in all of them is a pain, especially as we move to IPv6 only offices. The downside of using IPXE is that we cannot circuit build because the IPXE image is not signed. We also assume that you have access to the internet network. It makes it hard to actually test kickstart changes because the kickstart files live on the server. And whenever something breaks, it's really hard to tell upstream, hey, you know, like it breaks, but we can give you lots, but we don't really, we cannot really share our kickstart because it has internal stuff in it. And then COVID hits and everyone's working from home and how do you provision at home, right? So this is a system that we are moving to. It's modular. We use KS Flatten and KS Validate from PyKickstop so you can easily add snippets. So if we want to report something, we can just take out the snippet that contains our internal conflict and everything is fine. We then just inject the script that basically let our users run it and then it will prompt them to authenticate so they can join the internal network. We use Lorex to basically inject our kickstart into the standard net install ISO. So we get assigned bootloader, we can keep secure boot enabled. It's brilliant. And then we can just easily just boot that ISO on VMs if you want to do testing. So demo time. Let's hope this actually works. So I'm going to speed it up since provisioning actually takes about half an hour sadly on a gigabit network. With normal kickstart, it's a bit slow that most things are automated. We found some bugs that I'll go over later. This is at 30 times normal speed. So I'll probably just talk to this a bit. So as I mentioned, we have some users who actually require using NVIDIA as part of their workflows. The post install of this kickstart is actually going to set up API infusion. So by the time the user gets their machine and put it for the first time, NVIDIA is already enabled. And I'll try to pause when we get there. Oh, Neil, interesting. Yeah. Yeah, I don't want to mess around with the repo section a bit too much. So basically right now, this thing is going to auto detect and it will not actually enable API infusion unless you actually have an NVIDIA graphics card. Or I make it so that if you're on a VM, it will also install NVIDIA just so it's easier to test changes. Speaking of changes. So the kickstart to just see actually installs, actually installs, but to reverse instead of yes before, because we are not putting changes that are going to go into federal territory. It uses locks. Because we, as Joseph was saying, we don't have native but it has encryption yet. We Before all these changes, we used to use the default work session layout. So LVM and then using the route separate route and home partitions. We find, we find the same issue that costs Fedora to want to switch to using but the reverse anyway, in that root tends to run out of space. So, just before using better efforts, we switched to saying like screw it, let's use a single unified to do that home partition, which means you cannot rematch without backing up your home directory. So now we still use locks until we have a different encryption. So we are not really in a hurry to encourage people to like, hey, please wipe your exe for isolation and use this shiny better as you can if you want. But in the near future, we'll probably ask them to rematch anyway, if we can move away from using locks. The reason we don't really like using lots at much. I'll go over it in the next slide. We also use some dog food swap on zero. And that way we can get away from using separate swap partition. We will probably add a swap file. If we find a need for it. So some pain points. If you use the default button fs layout in Fedora and enable encryption, you end up with a separate root and swap partitions. And they are going to use the same encryption key, but the problem is that if you ever change one of the key then it's, you now have to type in two passwords to actually unlock your disks. And it's basically between those two keys and the user account. It's now you suddenly have at least three passwords before any of the taking into account any password that you might have to use. Maybe another pain point in automating it. This is that I'm not gonna doesn't have a way right now to actually say, hey, I want to install on the first non removable drive and please don't touch anything else. I had to implement a slightly cloudy workaround, which cash flatten doesn't like so basically, I had to cast flatten the kickstart and then inject this other script behind it. So some bugs that I'm going to report upstream, like, if you want to basically not hard code the locks passphrase that's initially used, it works in the text mode installer but not in graphical. And also if you want to install the Wi-Fi, it works in graphical mode but not over text mode. I'll get to step on this question later on using a swap file. Right now we use a swap on Ziren in this new kickstart. Most of our laptops have reasonably beefy hardware so on our initial testing, we don't really need a swap file. We, we are going to basically manage it using Chef if we find we need one. So basically, I'm not going to dog food. The nano ST4 editor change will just get it when photography comes. Considering that that development discussion is way spicier than the discussion about adopting but we don't really want to get there. So I'm not sure which of them might be relevant to our use case, but ping me if you, if you want to get some, if you think your change should actually and would make sense to get wider testing. So yeah, I mentioned earlier that we, we are probably going to work with some other internal teams at Facebook to test system D only before it basically enters up soon. Yeah, it's a really good time to start working on resource control on the desktop feet. We have already ships early on the, but with better fs becoming the default file system. It doesn't suffer from exe force priority inversion problem, meaning that we can actually constrain some operation without it actually becoming a problem because you have low access to this task actually blocking a higher priority one. One really cool thing that we might want to test as well really soon is the F a policy D. It's, it's a demon that someone at red hat is working on. We, we deploy something similar on our macOS fleet called Santa. And the idea is that with this, you, you can whitelist which application, sorry, you can basically get execution of binaries to only those that actually are considered safe. So for instance, but binaries have come from RPMs or binaries that are in the trust database. There are some features that I think it doesn't have right now. So we cannot say hey only trust certain RPM signing keys but don't trust RPMs and the user insults themselves. And we currently don't have a way to, to manage this fleet wide yet, but it's coming. Another cool red hat project that we want to try is the fleet commander. So with this, the idea is that you can deploy configuration profiles, instead of using a shaft actually laid on them. So it might make more sense for things that are a bit trickier to manage like my manager. One smag is that the upstream implementation heavily ties it to either AD or free idea, which we don't want to use because we, we basically want, want it to be easy to manage our client fleet when they are roaming about not inside the internal So we already use and help commenting micro MDM, which is device management framework normally used for macOS and iOS. So if we go with this, we might basically write back end for the fleet commander client that make it talk to micro MDM. So yeah, in conclusion, we have two minutes. The way we manage our desktop in the fleet is increasingly involves cross function collaboration with other teams, whether at Fedora or Facebook. And we, we hope to collaborate more with community members or with other companies that have similar needs as us in managing the fleet. So yeah, when you get to the slide, there are some resources here that have links to the first two are about talks that David gave about centers. The next one is a talk I gave last year at Flock and then our management cookbooks and the kickstarts are used for the demo. So yeah, sorry for only leaving one minute for questions. We can go over on this. Just some of the top people want to go to. And you can scan this QR code to get to the presentation or the slide is the link to the slide is pinned on the session chat.