 Hello everyone. My name is Davide. I'm a production engineer at Meta on the Linux user space team. This talk is about what we're doing with ELN and how we're using to test upgrades on the production fleet at Meta. The agenda for today will start with a quick introduction about what ELN is and how it's made. We'll talk about what we're specifically doing with it at Meta, what we got out of it, and how you can get involved and maybe get some value out of it as well. So, without further ado, ELN stands kind of our enterprise Linux next. There is an explanation of the puns in the name on the website that I will not try to repeat. It is a continuously variable of raw hide using the CentOS macros and toolchain. So, the idea is that every day we take Fedora raw hide as it is today. We rebuild it using the same macros that would be used as if we were cutting a new CentOS train train today and we put the compose out. So, you can take it and use it and you get a preview of what CentOS train would look like if we were starting today. It is part of the development process for stream and for rel and it is used for this ongoing process. That's the link to the documentation. I'll add the slides on the website later so you don't need to worry about the links. So, here's how the sausage is made more or less. Development work happens in raw hide all the time. Raw hide is continuously rebuilt into ELN. ELN is used to test how these changes will look like in the future CentOS train release. At some point next month-ish, I believe, the first branching for stream will occur and at that point we will have a new set of branches and a new set of repos and everything and that will be stream 10 and then stream 10 will be stabilized and then eventually we'll have rel 10 and so on. By the way, if you're wondering about the colors, this slide comes from an older presentation. I gave the orange boxes because those are all things you can actually contribute to and work on if you would like. You contribute to rel unless you actually work at reddit, which I do not. So, how this actually works is that the composites are made using something called the ODCS, which is the on-demand compose service, which is a thing that is able to make distribution composites and put them online so people can consume them. The package itself is defined in the content resolver, which was an offshoot of the federal minimization project that is still developed. If you go on tiny distro builders, you can see the actual package set and play with it and slice and dice it. That's a picture I took yesterday that gives you an idea. You can also see kind of the absent flows in this. As we get closer to the branch point, you can notice this is getting lower because I guess folks are realizing they don't really want to maintain stuff long term for 10 years, so they're chopping it out, which seems reasonable to me. Now, this is just for ELN itself, which is what will end up in CentroStream NRL. Most people, I would say, do not run just that. They need additional packages that they generally get from Apple. To test those, we came up with something called ELN extras. The idea of building ELN extras is an additional set of packages that are composed together with ELN, but they're not going to be part of ELN itself. It still kind of influx how we're going to do this, but the general idea is that we will use this set of packages to bootstrap Apple 10 so we can get a headstart there and try to make life a bit easier, both for users and for packages. This is also defined in content resolver. You can look it up there. You can actually contribute your definitions here, so if you maintain packages in a Pell and would like to get started or have them ready for Apple 10, you can put up a PR against that GitHub project. There's a little YAML file you have to fill in, and your packages will end up getting continuously rebuilt in ELN. Now, of course, this means you sign up for maintaining them, so if they break, you will need to fix them, but you'll need to fix them eventually anyway. Again, that's the package set. This work is coordinated by the ELN SIG, which is a Fedora SIG. We hang out in Matrix and in IRC. While ELN is primarily a thing that is happening on Fedora and Red Hat infrastructure and uses a lot of internal components, work on the project is by no means restricted to folks that work at Red Hat. You're welcome to join and hang out there. There's regular meetings on Friday, where it's discussed. These meetings are also good to hang out if you're curious on what's going to happen in Strain 10 down the road. I'll talk about this later, but this has been very useful, for example, to get ideas, features that might be coming down the road that you'd want to know about if you actually run this in production. And here's a link on details about the SIG itself. So much for the intro. Let's talk about what we're doing with this in Meta. Before I do that, a quick primer on the infrared meta. You might have heard Meta has a lot of machines. We have millions of servers. These servers, these are all physical machines running in data centers across the world. They all run CentOS stream right now. We always run CentOS. We started with CentOS Linux. Well, I started in 2012 and we were running CentOS Linux 5 back then. We went through several major release transitions. So we went from 5 to 6, 6 to 7, Linux 7 to Strain 8. We just about finished in now Strain 8 to Strain 9. I was very happy to discover last month that we got rid finally of the last CentOS Linux 7 host, so we don't have those anymore in production. So that's nice. So I could put 20-22 on that line. I expect we'll be done. We'll probably have a long tail of 8 stuff at the end of the year, but we should be mostly done with it by the end of the year. Right now, we're at 86% on 9. This is some containers on 8. Actually, 8 is the majority on containers, but containers are comparatively easier, I would say. And for contests, whenever we do this migration, we are provisioning the whole fleet. We don't do in place updates for production systems. So this is wiping the machinery, installing, which is fairly expensive, as you might imagine. Now, our customers don't love these upgrades, because they're kind of expensive, because you're wiping their machines. They have to spend time qualifying the distribution. Usually, the way this has worked in the past is that the new CentOS Strain release drops. We start working on it. We put it out. We will play with it. We find some major issue that delays everything by six months, because that's how it works. It's absolutely normal. When you have a large deployment, it's pretty common to end up having to deal with these things. With 9 in particular, the Shawan deprecation ended up being a fairly sizable annoyance, because we had a lot of packages that predated that, that had to be rebuilt. We also had a lot of packages coming from external vendors that, shall we say, questionable packaging practices that required a lot of work. So in general, we had been looking at ways to make this process better and more streamlined. When we did 9, I actually started working on this, started working from the very first beta that came out, and that helped quite a bit, because we could get a head start, since the, basically, since branching time. But you still get kind of surprises, because while if you follow Fedora, you kind of know what will probably end up, because you know where it's branching from, it's hard to tell until it drops. The way we do this better is if we could start this earlier, and that's kind of where ELN comes in for us. The idea with ELN is using it so that we can streamline bring up of major rest releases. Turn this from a thing that we have to do at qualification time when a new major release drops, to a thing that we do all the time. So we can continuously validate what's going to come down the pipe, find issues as they come up. When something comes up, we can either figure out internally if it's we that we fucked up, and fix it, or if it's something that we have to discuss with the community, because maybe there's a better solution that can help everyone. It also allows us to identify policy changes early on. Things that will come in the distribution that might impact us, since they've been deprecated, new things that are coming up. And because this is a continuous effort, and because we started long before, it's a much more pleasant experience for customers, because instead of having to deal with, oh, all my packages don't install anymore, and I have a month to fix it, now it's all my packages don't install anymore, but this isn't a production system. Maybe I have a year to fix it. It's much nicer. People are a lot happier when you tell them they have a year to fix it, and they can sort of sort it out to their own leisure. And long at home, we like to have an actual CI platform where we continuously roll out ELN on production systems for some value of production, so that every customer with different workflows can get a sense of what's going to come in down the pipe and do this validation at their own pace. Now, to do this, we started actually bringing up ELN in our infrastructure, and bringing up ELN is about the same effort as bringing up a new center stream release. So this was broken down into getting the repos, hooking them up into our update pipeline, adding them to our config planning system, which is Chef, actually provisioning and building provisioning, medium provisioning machine, and then deploying it and qualifying. And pretty much every step here involved fixing bugs, mostly on our end. So starting from repos, the repos come from ODCS, as I mentioned. There's the daily-ish snapshot, or weekly-ish, depending on when it's got the shows up there. We didn't particularly want to scrape that with WGAD-R, because that seemed kind of bad. So we worked with Infra, folks at Fedora, to get an Arsync endpoint exposed from there, so we could mirror it using Arsync. And now we have these exposed on our public mirror, so if you want to access that, and you don't want to hammer the Fedora servers, you can use the mirror on the face with the net. I believe we see that daily-ish, so it should be reasonably up-to-date. But we don't consume that directly internally. We actually snapshot it via another process that we talked about in a minute called runningOS updates. Now, this is only for the base repos, but of course there's also a Pell. So for a Pell, we use the ELN extras that I mentioned before. So we started populating two workloads in ELN extras, one for packages that we maintain as part of the CentOS Hyperscale SIG, which is where we do most of our upstream work within the CentOS project, and another workload for packages that are really specific to Meta, that we felt it would be easy to have all in one place to keep track of. You can look this up, I put the screenshot here so you can see roughly the volume. The giant jump was when I started testing development systems, which have a lot of random packages on them, so I ended up adding a few dozens there. That's what this looks like for, and these are all packages that we'll be maintaining then in Apple 10, or at least assisting to maintain, because we're not the primary maintainers on all of these, obviously. Now, once we have the repos, we have to hook them up into our base pipeline. I won't talk about runningOS updates in detail here. I talked about this in other talks, and it's not terribly interesting, because it's fairly internal to Meta, but the idea is that for CentOS stream, we snapshot the production repos of stream every two weeks. We roll them out on the fleet across two weeks. We do this via Chef, and effectively, Chef will update the DNF.com from the machine and run the DNF upgrade, and it just works, more or less. When it doesn't work, people get tickets and fix it. In the case of ELN, this process is much easier, to some extent, because we just made a release train called ELN, which is going to be continuously updated. We snapshot it every day initially, because that made it easy to eat very quickly, and later every week. We don't really care about in-place updates for ELN, because what we really care about testing is the provisioning and the initial bring up. We don't particularly care about testing updates from one ELN from yesterday to ELN from today, so we just stopped out of the test for that and automatically promote all the snapshots, and this basically works. The only caveat here is that ELN composes can be broken, because we publish all composites regardless of state, so we need to filter for only composites that are tagged as finished. Otherwise, you end up trying to roll out something that is missing three quarters of the distribution. Also, ELN Extras workflows can fail. They can fail, for example, if you add a package that doesn't build. When the workflow fails, the entire workflow is excised from the compose, which is obviously bad. Right now, we just keep an eye on it, but I need to actually write monitoring, so we do the same thing and validate them before pulling them in. Okay, now we have very close. We need to do the config management side. The config management side is fairly straightforward, with one caveat. ELN is odd in that it really is Fedora. It identifies itself as FedoraOI. If you cat, et cetera, what's really is this FedoraOI. The only way you can tell it's ELN is because it has variants set to ELN. However, everything else about it is sent us, and we really want to validate it as if it were sent us. What we ended up doing was something horrible, which was adding logic to detect ELN based on the variant ID and then monkey patching it so that it actually looked like sent us. In our Chef code, we have methods that if not sent us, and those we return through on ELN. Do not do this. This is a terrible idea. We did this because it worked for what we were trying to do, but not something I would recommend in general. After that, it was just the usual loop of testing it on a machine iterating, finding there's all these packages missing, and adding it to ELN extras, or all this logic is hard coding sent nine, but we can actually just flip the gate and make it hard code eight, because it's just a negative thing, things like that. Our open source code books are there. There isn't much that's ELN specific in there, but I put the link in case you're interested. If you use Puppet or Ansible or anything else, I expect it would be roughly similar. Frankly, I expected this to be a major pain, and it turned out to be very straightforward. Now, we have conflict management that we want to do provisioning. Provisioning and meta is complicated. To sidestab the issue, and because there's a circular dependency that you need Chef to have provisioning working, and vice versa, because the provisioning images are built with Chef, but you need machines to be able to test that Chef is working. We did another horrible thing, which was converting in place existing sent us from nine hosts to ELN with ENF Distar sync. To my surprise, this actually works. Not only it works, but it produces systems that still boot. Initially, I expected to do this and then throw the machines away, but they still booted afterwards with minor, minor caveats. And I would not do this in production. It is a terrible idea, but it worked very well for development because I was able to take development systems, convert them in place, iterate quickly on the Chef stuff, and then later, when the Chef stuff was sorted out, get the provisioning images going, and then we could provision machines from scratch. The provisioning system uses an internal thing that is in particular useful outside of meta, but basically, they're just images that get dropped onto machines, as you might expect. The images are continuously delivered and tested. So the provisioning system will build the image, try to install a bunch of machines with it, and then tag it if it's good or not good. Right now, we're at the point where we're effectively in feature parity with nine. We actually were able to reflect quite a bit of this logic to make it saner and better support future distributions. After that, it's just a matter of testing. So we started using, we have a small set of machines we call kernel test that we use for doing kernel and hardware enablement work, which is very handy for these things because it doesn't run anything. So you just have to get the base OS working. So we got that going, and that took care of most of the basic Chef stuff. Then we switched to doing the development servers. The development servers are great because nobody cares if your own development server is broken. There's no services running on it except your editor. But they have a ton of packages on it because they have to support all possible work cases from building the Android operating system to doing God knows what. So these are very useful finding packages that were missing. We didn't need it to add to Elan extras. We didn't quite know. Also, we were able to work with the folks that maintain the systems to discuss, oh, this thing you're doing here, maybe you should rethink it because it will probably break in the future. So this is kind of where we're at now. Right now, I'm working to expanding this to the container platform systems, which is the ones that I was actually entertaining talking since the start, because that's where the product, most of production actually runs. And those will give us a good measure of whether this will actually work long term. And also, the container system has extensive unit tests that should be able to tell us when things break very quickly. So let's talk about what we got out of this. Took about four months to go from nothing to have been provisioning working. This isn't four months of like has done work. This is four months of like spending time on this, but also doing other work. I suspect if you were speedrunning this, you could probably do it in half time. We now have some dozens of systems running on ELN. I can't say exact numbers for reasons, but it's not a huge amount. I expect this will expand in probably low thousands at some point. I don't think we'll ever have a large deployment of this because it doesn't really make sense, and because these machines will effectively not be doing useful production work. We have started conversational wider deployments because at this stage, what I would like to see is different product owners at Facebook to start trying this out and evaluating, so say, the database folks can use this to help the qualification for the databases in the future and so on and so forth. We've also began preparing for since 2010 because the work we were able to do with ELN made us aware of changes that might be coming in 10 that we should start thinking about. We're also able to start testing things like the NF5 because ELN is built from Fedora. It has the NF5, so we can actually install it and play with it and test it and start integrating it. Changes that have been deployed and discussed as part of ELN can also be integrated and discussed internally. There's discussions ongoing if you look at the meeting meeting youth for previous ELN meetings. There were discussions about, for example, dropping 32-bit packages. That's something that I was very happy to know I had a time because I can start talking to the people that use them. Then, of course, we found bugs and we fixed bugs. I will give you a few examples. These were the very first thing we found that when I put a machine up, immediately after converting an RPM-QA and I had a shit ton of errors. There were all that weird invalid OpenGPG signature. You might know that RPM for 18 ships with Sequoia, which swapped the entire OpenGPGP implementation, we want that actually works. Notably, these validate packages and make sure the packages are actually compliant. It turns out our internal signing service was using a Golang library and the Golang library was generating invalid signatures by design. Literally, if you read through the comments, it says we know this is broken, we don't care. That wasn't ideal. Luckily, we found out that the Protomail folks at a fork of this library that was actually compliant. We swapped it in. This fixed our signing service, but, of course, we had all of the existing packages that now had to be fixed. We had to wait for a few mass rebuild cycles. We still have a long tail of these because not everything is auto-rebuilt, but it's at the point that it's manageable at least. If you have your own signing implementation, I highly recommend making sure this is not happening. Also, maybe don't use Golang. Now, this isn't all the fun we had with Sequoia. After we had the packages, we installed a system from scratch, and we saw our key wasn't importing. Now, you might remember I mentioned in stream nine, the policy this allows Sha Wang. It turns out RPM never validated the key at all. When you imported the signing key on nine, it imported just fine, even if it was signed with Sha Wang. Probably, even if it was signed with MD5, I suspect. This is a problem. It turns out the key was not actually valid. We were using a very old, well, it wasn't an old key. It was a key that was kept updated, but they kept signing with Sha Wang for reasons. We fixed that, and then it was fine. There's actually a very handy tool that I got packaged into a Pell for this. That was already in Fedora, that's part of Sequoia, and I branched for a Pell because as you hear in Glinter, you can feed it a key, and it will tell you in human readable form what is wrong with it. It also has a fixed option where if you have the private key for the key, and it's an actual private key, not an HSM or something, you can just fix it and spit out a fixed key. That can be handy if you're your own key. If you have an HSM, you will need to work with your security folks on how to fix this. You may want to talk to your vendors about this. I audited hours. I would say at least 50% of repos that we were consuming were invalid, and most of these folks, when I talked to them, they had no idea what I was talking about, and it took quite a bit of effort to make people aware that this was actually an issue, and they should do something about it. Also, if you're resigning your key, make sure you don't actually invalidate your key, because otherwise, upgrades will be very painful for users, which actually happened with at least one vendor from what I can tell. Okay, enough about RPM. The other fun thing was Utilinus. So, ROI ships with the very latest Utilinus 239, which is a lot of fun features. It also had a couple of regressions that we managed to hit. One was easy. They changed the option parsing for NSenter. NSenter has some bespoke logic for option parsing. That changed lightly where you had to pass equal something instead of just doing option value. Okay, we fixed it. That wasn't a big deal, I would say. That was also easy to work around. The more fun one was the mount API. There's the blog post that explains the mount API. We discovered this when I started provisioning machines that they never came up, and I would go on console and see a slew of error coming from system D. And then when I finally managed to find where the console password is, I would get in and see that the root file system was read only. And if you passed RW on boot, it would find, but it would, for the life of me, I couldn't get it to remount, and if you try to remount manually, you'll get inbound back. Turns out, there is a bug in Utilinus that if you're running an old kernel, it doesn't quite detect that the mount API is fucked, so it should use the old one. Funnily enough, I was talking to Edam the other day, and this was actually independently discovered in Rohe, thanks to OpenQA. But we don't run OpenQA for ELN, so on ELN, this leaps through. This is fixed in Utilinus, by the way. I think I put up a PR to backport it in Fedora. I don't know if that got merged, but I'm sure it will get sorted out quickly there anyway. But this is a good example of something that we were able to catch quickly, and get sorted out in a way that is hopefully beneficial for everyone. There's also a couple of other things I wanted to mention. By the nature of how ELN is built, it is 98% like CentOS, but not exactly like CentOS. So first of all, sometimes just people playing fuck up, and it happens. In this case, we found that ODD was installed using the Fedora role set and not the Rally role set, because there just wasn't any logic in the package for this. I guess when it was branched for nine, the maintainer probably branched it and then just updated the CentOS side of it and never backported the changes. So that was easy. Put up PR to the conditional. We only noticed it because in Chef, we disabled ODD, and the way you disable it is different depending on whether it's the Fedora or the CentOS role set. The more interesting one was SystemD. So you might know the SystemD STC called Presets that defines what services will start up on boot, on first boot when you install the package. So you will tell you when you're installing this package should the service be enabled or disabled by default. The presets from Fedora and the presets from CentOS are very different, because the use cases for these distributions are different. ELN uses the Fedora presets. So you might want to keep an eye on this if you deploy this. The way we found this out is because Fedora explicitly disabled SystemD NetworkD, which we want to explicitly enable instead, and we never actually did that. We just, because there was no preset for it, it would get enabled by default in our case. So that's something that caught up by surprise. I don't think there's, we've discussed this with the SIG, there's not quite clarity right now on how this is going to be handled on the, on the CentOS claim side after branching. I expect this will probably keep deviating, because that's not really a good way to, to manage them. But that's something else you should probably be aware of. All right. So what can you do with this? Well, as I said, CentOS and Ten will be branching soon. There is a thread on the belt that I would encourage you to read that talks about the plan, how, how the branching is going to happen, when it's going to happen, how this will work out. You also have about two weeks to get a system by changing Fedora, if you would like. And getting a system by changing Fedora now has a higher chance of it maybe ending up in CentOS stream. Definitely a higher chance of having to do less paperwork than later. Because once it branches, it's a lot harder to get changes into stream for obvious reasons. But also if you maintain anything that runs on stream or on rail, and you would like to get a head start, this is a good time to start playing with ELN and trying it out. It's also a good time because once stream 10 is fully branched, ELN will transition to 11. And we'll, we'll be able to start doing this all over again. As I said, you can join the SIG, the SIG Hangouts on Matrix and on, and on IRC. The channels are bridged. If you maintain packages in a Pell, you can contribute them to extras. That's a link to the ELN website in general that has information and details about the project. The documentation itself is on GitHub, as is most of the code that controls the build itself. And the Fedora minimization work and counter resolver is what actually controls the set of packages that you can find there. And I would be happy to answer any questions. Yes. Yes, that would be very helpful. And the question was, we found this out because we didn't run OpenQN ELN and whether that would be helpful. And yes, it would be very helpful. So, I think in general testing composites would be useful. Which test, which set of tests, I think is something we would need to, yeah, we would need to sit down and figure out. I don't know if doing the same test that we do for ROI is useful. I expect a lot would fail just because it's a very different system. So, maybe looking at what kind of tests are done as part of qualifying stream and ELN on the other side of the house and seeing if it makes sense to pour those to OpenQA, maybe. That could be interesting to look at. Any other questions? Yes. Oh, yes. I meant to add this, but I didn't know where to slot it in. Yes, there is support. There are shows for ELN. There is also support for a packet to build against ELN. Yeah. So, what we actually do is for our open source projects, we use packet to get continuous bills and signals so that when folks in internal land changes, they won't break the open source bills. And we built this in copper for all the releases, including ELN for packages branched to a Pell. That works really well. So, it depends. There are ISOs. How do you install ELN was the question. So, there are ISOs that you could theoretically use. I have never tried them. What I would do is the like DNF install route process that you use normally to build images or you can use your favorite image builder. Then who is there, landed support for building ELN images in MKOSI the other week. So, that's an option. I believe they should work with any other image builder. It should work with you, we probably do. Yeah. But yeah, in general, I think this will depend a lot on what you want to do with it. Any other questions? So, that was the question I actually had. And that's where the discussion started. The question is whether we could fix the preset issue by providing a federal ELN release package with the real presets. The problem is that we don't know what the real presets will be. Because you can't just blank use the ones from nine because it's different enough. And my understanding is that the way these presets have developed as part of stream is that they start from the right ones and then over time they get morphed into the new ones. You could just start with the ones here at nine and then they get adjusted. I would be in favor of having something because it would make my life easier. I think it's definitely something we can revisit. This would be a good argument to discuss in one of the next sick meetings. Yeah. The advantage that we have of course every once in a while is that we can look at what conflicts changed. But perhaps you could have somebody that or a bot that every time that we set the change in the Fedora package open an issue on the ELN package so that people actually take a look at the change and decide what to do. I think that's a great idea. Yeah. And to repeat for the stream, the proposal was that we have a process where whenever there's changes to the package in ROI an issue is filed to see if the corresponding change is meaningful for ELN and for the future stream as well. I think that's a great idea and it's probably something we should look at. Yes, Neil says that's probably meaningful only if the package is forked. I don't think so because you might have packages where you want a service to be enabled in Fedora but not in stream because it's not supported because it's not part of the default set. You had a question back there. So I haven't seen breakage from this just yet. I expect I will at some point. I knew about this change because it came up in the last meeting and to repeat for the stream, there's a change in ELN where it will use x86 64 bit as the baseline for x86 and this will cause breakage because anything older than as well I believe, don't quote me on that, will stop working. Anything, any hardware that is older than a certain GPU generation which is probably has well will stop working. So we haven't seen breakage from this in production yet because the systems I was testing this on were all fairly new. I did start inventing stuff. I don't have hard data on this but I am fairly sure we would see some breakage from this and this is something we will need to deal with. Yeah, I think there's a long enough window though between now and when stream 10 and real time would be released that I expected it should be less painful than it looks like now, hopefully. Yes, that should be. Well, I think it should be changing sooner than later. Certainly it will change in ELN. I hope it will also change in all of it. It's not so easy but I'm pretty sure it will change in ELN and also one interesting change is that even if you go by default until QF7.2, there was maybe a support so there was no way to do it with the unicode generator but now it's implemented so right now if you don't use KVM or you cannot use KVM, you can actually run ELN with the support of VMS. Yep, and to summarize for the stream, one area that's impacted by these is VMS because KVM right now by default uses, Kimu uses an unsupported asset that's older than V3 but that is getting at rest and should get backported to our right as well hopefully. Yeah, by me we should have tests for a lot of things but yes, I don't disagree there. I haven't done much testing on the Retrolegation stack so I don't have signal on my end on this. I was going to talk to the Retrolegation folks on my side to see if they could hook this up in whatever they do. Sure, we'll do. Thank you. Anything else? Going once? Going twice? All right, thank you very much.