 Okay, my my clock says that it's time to start so maybe you can start now Thank you for coming Especially after lunch, or maybe you came for just to sleep. That's that's also a possibility I Will try to explain because brand in the first session just before the the lunch Explain how we were accepting Fedora CI jobs in the center system ecosystem So I will try to give you a little bit of information about what's happened behind the scene But also how the centers infra runs what we share already as tool with send with Fedora and the future So the first thing is the mandatory slide for who am I if people Don't know me yet. My name is Fabien Rotin. I'm a Belgian guy. I work in the centers Project, I think I joined officially in 2007 and I'm in the season in infra team also QA team and sent us board member So we'll covered some of those things like I will give you a little bit of history about sent us Because it can explain from where we are coming especially from the infraside That will give you another view and then we can dive into how we will we are slowly changing thing in the process So what's different and what's coming with the Fedora infra those days and the future so It started a long time ago who remembers when red had decided to split between Red enterprise and Fedora raise the hand Three guys only okay So it started a long time ago, and that's how it started on our side We embraced that Myself I was happy to test it was fedora core 099 if I remember well the the one that was supposed to become to to be read at Linux 10 and We were just happy with that except that for our own system at the infra level we decide to Have something that was really interesting to rebuild just for academic purposes and only Nobody would have expected in those days that the centers project would be so used by other people all over on the world So it started from a Project called cows Linux which doesn't exist anymore those days the goal was for cows project itself was just to Build a new distribution based on existing technology like rpm mostly Sub project of cows Linux was interested into rebuilding red out on price Linux source just to for academic purposes see how it was possible to do it to rebuild on itself a distribution and It was just for fun. Believe me. It was just for fun. So it started Basically from people having one machine in the basement or in the garage doing nothing spare time And that was used to recompile package mostly So we started from zero infra at all nothing but then we had some Publicity advertising, I don't know how to call it from Stephen Jenny calls who did an Comparison of the rebuild of red at enterprise next at that time So there was tau Linux if some people remind remember Scientists started to become a thing We were there and the first one was a white box Linux, which myself. I was also using at a time We started with that and suddenly some people were a that's great We are starting to use your package and your distribution. So What do you need? Well at a moment? We have nothing. So just one public machine with public Presence like a website or forums with help. Here we go Do you want to use the same machine as a first mirror? Of course, we we want nobody refused a machine connected at 100 megabit connection in those days So we started slowly and we got one two three four farm machines Officially it really kicked out in 2007 when we had the first official presence for the project at at Fosdham in 2007 and We had a lot of we started believe it believe that or not But we started discussion with discussing with the federal people because we were sharing the same distro dev room in that for them event So we started really from scratch, but we needed to automate as much as possible really because we were all having different jobs and We didn't want to spend too much time on the centers in for itself So that's where we started to invest time into things called puppets Which we use since version zero dot something. I myself. I remember having migrated from zero to twenty three To then three four to the five to the six etc up to three dot eight Yeah before get there was also another thing called subversion which we use a lot those days to put everything under control and Before the cloud was a thing We had already the puts versus cattle syndrome Because all the machine that were donated to us We had no guarantee that the machine would be there the next day or two weeks after So we had to consider the machine. It's great. The machine is there the machine disappear We don't care So we had to automate as much as possible from day one The reason why machine disappear Well, yeah, there are plenty of reason first one which is logical is because of an artwork failure, so We have we had a lot of machine in those they were which were just Single machine was just single drive Saddle drive dying in a machine die on us So I would consider that normal for a machine to disappear the monitoring system pop-ups and said our machine is not there We have contact and okay, we get a sometimes replacement What is really more interesting is the next case where? We got contact with a company who decided to say hey, we want to sponsor you a machine or two But the machine runs you don't notify the fact that that company was acquired by another company and Sometimes that second company was even acquired by another one a third one company and The third one say oh, we don't care about open source. We don't have any special program for open source and That's machines is abusing a little bit too much of a bandwidth because it's successful So the machine sometimes we if we are lucky them they contact us back to say we don't we want to stop helping you But most of the key or most of the time the machine just disappeared without notification at all Or the company goes bankrupt completely and The only way to know is that you use Google You find some press release the fact that your company just go bankrupt and went bankrupt and you just discovered so I Don't think that I have a blood point for for the another case interesting case is that sometimes we just Proactively contact them just to see if they are still alive. I do that on a regular basis to just reach out and Sometimes they do their answer. Yeah, we're still happy if you want a new node a better know replacement That's better and sometimes I have some kind of answer like What no we don't run any machine for you Yes, you do here's the IP. Yes It's in our subnet but no we don't and then I'm scared because I'm just wondering what kind of inventory they use and We we I had the that issue two weeks ago. I had one of the disk dying machine I tried to contact and say no, is it the machine are still running? No, no, no So, yeah, it happens to us a lot. That's one of the big difference Maybe with the fedrine fries that we just ran from community donated machine The scene sent us means community not in the sense that we have a lot of developer Contributing back in those days, but mostly company sponsor in machine bent with because we need it So we started with a kind of trust relationship with those donor Because you don't want to run something crucial sensitive If just someone from real stress and say, oh, I just want to rent to give you a machine So we just start by using those with something that is not crucial Like a mirror machine because on the mirror itself everything is GP design checked control And the first thing we do is we don't want to spend too much time doing a no detain machine itself So we just reinstall it from scratch automatically That's how we start with those donated machine and Something that we we have a kind of automatic process for that is we try to as I said to reach out to see if their support mechanism Works fine or not and based on the experience, you know, if that machine can be used for something else or just stay at it as it is as a member of the cattle and depending on response time and The experience we have with those guys We just slowly move more critical roles like And it's records because it's still running from donated machine even those days So the infra was growing up But we had more or less the same issue as in Fedora Except maybe that's surprising we had at those days We've even I have more external mirror than what Fedora has So we have at a moment more than 600 external mirror fetching everything from us for the package to update the ISO image So it's quite successful I would say because on our side even with the number of machine we got from community. It's not possible to sustain the number of requests and We had to well for example just to feed To send those package to those 600 external mirror. We have what we call the emsing row, which is mirror and erasing Using access control is to just let those external mirror fetch from us So sorry if you don't you you are not able to fetch directly from us But we have to protect the bandwidth because some machine in those 67 Machines are connected at very slow speed I have some machine in mind in some part of the the world like in Malaysia for example We have machine connected at 10 megabit per second to us It's international level So wait a minute. We are in 2018 10 megabit per second Yes, but keep in mind that Bent with cost a lot in those area and they have international international connection that is really limited but In locally in the country that's 100 megabit or gigabit connectivity So even if it takes time just for us to see that machine there Suddenly it's really fast for the other people and other mirror in the same country to get content So it saves bandwidth on our side as well. So it's a win-win situation We had quickly to implement something that exists also at the federal side. We do geo IP Of course for yam for the mirror lists So you are directed to machine that are validated in your own country or nearby country if there is no mirror in your Country, but we do that also at the DNS for record like mirror the center of the log You are redirected to the machine that is closest to you So what's different from the federal infra? Some of the application that we still run for them the users like as I said the mirror management at the moment is still Different so it started from custom scripts Pearl initially slowly converted to Python those days versus mirror manager There's some discussion about Switching to mirror manager or not in the past that was not possible at all because we had no Authentication system that was required for that now that we have Fast I will cover that later Different instance of fast our own version that would be eventually possible But we have a discussion with Adrienne by the way just you to eventually move to that The bug tracker will probably just remain difference We had in the past to to select one of the open source solution Bugzilla was not on the shortlist back in the days So we are still upgrading from Mantis bug tracker version two to the latest version So we still we are still autonomous It can be problematic from time to time because we just have to link to upstream bug in bugzilla.radar.com And so you still have to fill on our side a bug and then we have to link to an external bug report So it can be tricky from time to time, but there is no solution for that right now For the wiki Or if you use for awful quite some time you are probably you You know that it was used for for the right was moin which was migratum media wiki. We are still running moin those days There on that aspect We are also investigating two things either just migrating continue migrating from moin version to the other or switch to media wiki for a simple reason Because we know that it works for Fedora is the open ID authentication That can work on our side So that's one of the open points A Big difference is probably the message bus For Fedora there is a huge huge huge message brush the Fedora message bus the Fed message bus is really used for almost everything It's really verbose maybe a little bit too verbose sometime on our side we Add no requirement to have that kind of bus in place For the infra part we just built a small one, but was based on MQTT which I liked a lot I use that for other project Because it's lightweight it supports TLS of the box it supports access control is out of the box So we have that in place for some part of the infra right now I will explain one specific case for example later and Infra monitoring is also done differently For us to use Nagios Nagios on our side we decide to use Zabix a long time ago Mostly because that's the center steam infra team was really small. It's still a small so basically at the moment It's just Brian and me but when we started to investigate about Monitoring Zabix had some really cool feature that we wanted to have built in That not Nagios was lacking those days for example The fact that out of the box it has Central DB to log everything all the matrix that you collect So we have all data there because we migrate from Zabix version to Zabix version For more than 10 years and we had no problem with that Another concept that is interesting is the fact that you have proxies If you remember the fact that all the machine from CentOS are spread all around the world You can delegate at the tagging that everything works fine just to collect the matrix to proxies So for example, you don't want a node in the USA to just eat all the machine for example in Australia or Asia just to check that everything is fine You just delegate those tasks to a kind of proxy which report everything back into the central server It has also an API which we can use a lot to automate plenty of things So there are some tools available written in Python called Zabix CLI or if you are using Ansible There are no modules just to for that just to configure Templates for example tied to a machine so one example is I Mention I Mentioned puppets which at a moment. We are still running We migrated from puppet version to puppet version and we decide to use foreman I'm pretty sure that everybody knows foreman. We use foreman as a form as a puppet external node classifier So puppet everything from puppets is in the form and database so Who in the room has already played with form puppets or is still using puppets? One guy two guys, but I know you are lying. I know so One of the main point on my side with puppet is that it's great for desired state So you declare the desired state for the machine and you apply that locally but what about the fact that The monitoring has to know everything about the other node puppet by default had no solution back in the days. So the only way to do that is to Export all the thing back into another DB called puppet DB And then for the monitoring server role It just consume everything that it fetched from the puppet DB So knowing all the facts from all the machine and verify that it's still applied So if you multiply if you multiply that by the number of nodes you have some time puppet catalog Apply that that takes more than 30 minutes just to verify that nothing changed and it's still good so As it was a little bit consuming I said it's a little bit strange to have everything in foreman Being applied to the machine he ported back into another DB and then applied by the the monitoring server So I decided to use some kind of shortcut with foreman hooks Foreman hooks is really interesting because It's triggered So when you had a node when you modify a node when you delete a node it triggers and something at the back With the foreman hooks and that's where I was mentioning MQTT for example At the moment when for example we had a machine or we just tag we just add some puppet Modules and some puppet roles on the machines. It will automatically notify the Zabix server through MQTT on a specific topic and At the other side it just really just Subscribed to that topic and verify. Oh, I'm just now a mirror machine or I'm a DNS or whatever I will automatically apply the template and the check directly So that I it doesn't need to consume a lot a lot of resource So that's one of the that's one of the reason why we just decide to use that So what's common those days with Fedora infra tool? Well, it's not a secret that we use Koji brand mentioned that It's not the Fedora instances. It's our own instances on CBS for community build system that centers the road but If you are working and if you are contributing to Fedora and centers at the same time You are feeling like home because you know the tool already. It's there, you know how to use it and nothing change Same thing when we decided finally to use the same Authentication system that Fedora was using fast Except once again that it's not federated At the moment, so it's still using its own fast database Which is accounts that sent to the road? And we still use that heavily because we try to plumb more and more things On the fast system that we have Because not everything can talk to fast directly we need something in between Exactly the same requirement as for Fedora. So nothing secret here epsilon So we also use epsilon just to provide a way to get through fast through open ID OpenID or openID connect And that permits us to have multiple no application using our authentication back-end so other common tools if we consider that both project have the same requirement more or less let's say 80 to 90 percent of The problem to solve are the same We are slowly migrating to the same tooling the same tool chain for infra One big change that is come well big that was a long time coming But it's finally happening is migrating to Ansible from puppet for various reasons because we had already a lot a lot a lot of ansible playbooks to that we were using in the Center Simfra for deployment for ad-hoc test orchestration, etc. That puppet was not able to do that natively and M-collective was something I was not keen on using so We are using a lot and see both for the CI environment everything in CI in that environment. I will cover that later is just Deploy through Ansible end-to-end So we'll migrate to Ansible soon and the fact that will migrate to Ansible soon If you are interested in contributing to the infra or just have a look at that We'll just have some git repository world will just slowly publish everything all the roles that we are slowly Converting from puppets to to Ansible So then the the big change that is coming and that was probably announced if you were in the other room yesterday Was the git merge thing? I have to say thank you to Jim to have thrown me out on the bus So now that I have to mention it. So it's migrating to Pagura Or Pagura if I want to say the correct. Yeah, but that period is not in the room. So That's that one that change is coming We'll have to do a lot of messaging around that because it will change the way people are building through Koji But at the same time it will just make more sense if you are already contributing to Fedora because suddenly oh, I know it I know what to expect from it It's not migrating to Pagura only what was announced yesterday was that slowly will have also some Replication between the Fedora instance and our instance so when you When you will Push to a branch that you have access to for the Fedora site It will be replicated and visible on git.centralorg in the Fedora branch Reverse is true. If something is pushed on our side, it will be pushed to the other side and visible for everybody one small remark that was not That was not said yesterday. That's just for the git repository The look-aside cache content will not be synced at least in the beginning because it has some different Directory structures. So something we can consider later another tool which is coming it's documentation and Will reuse what was announced recently for Fedora for docs.fedra Project.org. So we'll use also the same tool. Same toolchain. No need to reinvent the wheel each time It's better to collaborate and reuse And that's what we'll do So a little bit remark some remark about the center CEI environment So brand give it give a talk about the process on how package work tests it In the CEI environment it targets mostly apps.cei. So which is the open shift setup But we have more than that. So Yeah, we had some happy donors and sponsored and Red Hat is the biggest one in those days since we joined So thanks to them. We have some bare metal notes available for testing So outside of the machine that brand was mentioning friendship setup We have at the moment 256 machines which are in the ready pool just to be used have used for Reinstall and test I will cover that just after We have also Erdio-based opensack cloud setup in place so that if you're Don't need a bare metal machine, but can be running this in In virtual machine. We can just abuse that cloud environment for that. I will cover that later as well And yeah open shift because a brand mentioned it already So we try to eat our home dog food all the components that we are using there are also built and tested through CBS and test it in CEI and then we use them as a kind of Matroshka thing just to you test what you produce on top of what you built already so And it works pretty well so far So for the bare metal note as I said, it's just Ansible deployment nothing fancy. It's really simple task that just talk to the hardware provisioning machine reinstall the machine have kickstart that are just basic changes to template and Depending on what we need because at the moment we cover sent us six and seven It just reinstall machine it's covered now. It covers also Power PC 64 le so pickle power pickle as Peter would say And yeah, nothing really fancy, but it works quite really well On top of that we just have our own application written and maintained by Brian, which is called Duffy, which is a kind of abstraction layer For the CI project. So when they want to get one two three nodes So multi-node setup for one job is possible at the moment for bare metal you request for example up to I think the limit is The six node six node per call. So you want six node in one shot is one second later You get six machine available for you automatically with your SSH keep provisioning site That's what we more detail on the wiki page if you want and also there is a link on the on the Get up for the source of that if you want to resist it so quite simple a Forty-open stack setup. I would say that it's simple as well But usually not today. It seems some people say are you crazy open sack? You use open stack and easy in the same sentence. Something must be wrong with you and probably yes but We decide to keep it really simple So the current status is that we add a previous one based on Newton release and now it's running pike the Deployment of the controller and all all the compute nodes in that setup is also unstable driven completely the machine the machine deployment at a better metal is done through unstable and then everything else is done through Instable as well. So adding machine into the open stack cloud setup Everything is automated One big difference is that we don't consume the cloud the usual way people would consume the cloud The in the CI environment all the project that want to run the test are not tenant So because as I said we are using Duffy as a kind of abstraction layer to to let people consume a resource That's the same risk. That's the same thing for The the cloud instance You ask a cloud instance through Duffy Duffy itself is a tenant and can consume a lot of resource from that cloud Transparenly and then when the machine when you know need a machine the machine is just kicked out That's the difference with Bermuda when the but the machine is just automatically reinstall with a kickstart I just had a look at the number by the way for the reinstall of the machine at the Bermuda setup this morning. We add more than five 570,000 Physical reinstallation in CI that's into the road. So I think that's quite impressive and Even if that's the case principle We try to use the same principle, especially because we are lazy guy and we have to maintain that We tried to when you have a look at the open stack setup The traditional geography a lot a lot of components because open stack is nothing more a lot of services talking to API of each other Components, but you don't need specifically all the components. We wanted to give something really fine really simple to manage We just wanted to go with a flat network for specific reason. He don't control the switches anyway So we had to keep things simple just by using them for the neutron config Just using layer 2 and bridge mode so that the virtual machine are in the same network as the Bermuda machine for testing everything in the same pool And as I said, we tried to to be as Minimal as possible Meaning that we just needed keystone, of course because that's a cornerstone for keys for open stack Glance for the image store Neutron for network and Nova for the compute node. So hypervisor rule nothing more We don't even need horizon the web UI because it's just API. I mean, it's just Duffy API talking to open stack API nothing more so It's not a secret that we started collaborating with the further I'm fratting because we were facing the same issue but it's happening more and more and I'd like myself to to say thank you to some people like Patrick because Patrick was really helpful for some of the implementation like for fast for For epsilon and recently for a picture boot. So thank you Patrick Same goes for Piri, which is who's not there today, but it was very helpful for the pager integration Some feature they will see coming in pager art because we requested for that. We thought that it would be interesting and Smooch which is not there, but yeah, anyway, that's even easier just to say thank you to the wall feather I'm for team That's what will be easier for me and more collaboration later I Think that's it for me on my side. So if you have question we have time and a microphone No question. That's easier for me. Oh one question What can use the microphone so that I don't have to repeat your question? It's supposed to be working at the back Should be pulled on and now it works collaboration effect. Thanks Yeah How you How do you decide? This kind of account management we use for example, there are alternatives like free IPA how you do such decisions and About the switch from From puppet to Ansible Are you're still using foreman or are you using something like tower? So two question the first is about authentication so Initially back in the days we decided to to to to use something central centralized because It's it was a problem when the forums were using their own authentication back in and then another system another system in the Wiki in the book tracker. So we were in a need to have central authentication we had a look at what was available and we also had at look at what federal I was using because One of the the requirement was community portal So community self registration people could self register for an account and then be promoted If you have a look at IPA IP is really looking at enterprise thing But they was lacking that kind of self portal nobody can self register in IPA except if you write something on top which those days it was more like It was just for Kerberos authentication and The way we deployed Koji initially for the build system. We were using X519 so TLS Certificate authentication Which fast was doing So fast was a meeting requirement But if you are contributing to federal you know that Things are slowly changing at the federal side because there is already Kerberos now behind the scene it's because of Fast is talking to IPA database back end So that's something that will probably just need to spend time on to catch up But our be at the same level at the federal guys To decide to finally migrate to IPA database for multi master replication, etc So that was for a question. Does it does that answer your question? Okay. The second one was about migrating from your puppet informant to sensible That question is I would say maybe still open in a sense that I While I liked form and a lot for puppets it made sense because it was his target It was the primary target. So deployment and puppets puppet dashboard, but The way you can integrate and simple with with puppet with form and those days is really limited you can't Run ad hoc task. It's just you can just tag role nothing more not even playbooks using role or specific orchestration So it doesn't fit It doesn't you know, it doesn't work for us So we had a look at tower. Well, I'm not at our AWX and AWX At the moment I was surprised that it's lacking some of the feature we want like Well on the other thing at the authentication level it should be possible to use some of token now so going through epsilon that would work I'm also discussing with the fedora guys because they are facing the same issue. They are already using Ansible But just only just sensible play natively but on the other side I Admit that well, it's not a secret that we use Jenkins a lot right For CI For a previous job, I was also already orchestrating While using abusing Jenkins for operation site launching Ansible playbook and delegating those tasks or those role two for example people from release management and I have visited that doing a comparison of what AWX can do and Jenkins and It seems to be I don't it's from the crazy ID But for us it seems that even just using Jenkins as a kind of cron Executor on top of Ansible playbook makes sense. I can explain you why I have a slide just for that because I give a talk just about that But I don't think that we'll have enough minutes for that but in two minutes We have machine everywhere So latency is a is an issue Especially for Ansible through SSH. So there is now you probably heard of a metogen thing Which is not in Ansible core, but it's developed by a guy from UK to speed up things. You should have a look That's really promising But still If you have one AWX machine From whatever for in the US for example targeting machine in Australia or in China or Malaysia or Brazil That takes a lot a lot of time just to run the set of the playbook While in Jenkins what you can do is that you have one master and multiple slaves Right, so that's the traditional way for CI and building package try to apply the same principle for operation You have one Jenkins master that knows everything about how to call a specific Ansible playbook And you can divide it by region for example, you can have one slave per region or even park country depending And that is the executor to the machine launching the playbook and report back the results to the Jenkins master in one location Jenkins support also G unit which Ansible does by default it has a journey G unit callback. So you have even Good graphs in Jenkins automatically. It's not meant to be used on top of J of Ansible, but it works fine at the moment in In AWX there is no concept of you have multi-node setup. That's true but more like in HA and Using the same postgres DB So no, I don't think that I in in in Europe accessing a postgres over a VP internal just to be able to locally launch my playbook Does that answer the question? great What we can just discuss later of after implementation details if you want So of the question. Yes The microphone at the back, please Actually, I have two questions. One is the Ansible repo that you guys use like in all the playbooks. Is that open source? Is that public? Not yet because I told you that initially I of I said that Officially we were we were a puppet shop and I think that I probably just forced a little bit too Ansible to enter So it was semi-official But we as now we decide to officially migrate from puppet to Ansible. We'll just review everything and publish everything On probably on get up on first step, but after that on Paguer when we'll have migrated Paguer. So yes, that's a goal Okay, my second question is can you talk at all? Can you compare about? If it if a contributor from the community wants to start volunteering on sent awesome for Can you compare the process for doing that versus the process that we use in fedora for onboarding new people? That's The answer can surprise you but that will be really easy in a sense that the question is about comparison There is nothing to compare. You have a process. We don't That's a problem and that's a problem that we have to solve really that's a reason why I wanted to attend your talk but I was I wasn't able to because I was busy with Patrick doing something but Sure, that's a reason why I would like to collaborate more because we can learn a lot of you and Probably you can learn some of the things from us because it makes sense to search on each on our side to solve exactly the same problem So we can learn from each other. That's why we said collaboration more and more, right? So That's that's exactly the point where we want to go with the fact that we want to go to publicly Show everything that is used so that people can just do pull requests For example against code in the infra without even a need to touch the machine or just getting inspired or whatever So that's that's a process. We need to that's basically how we want to do things But yeah, that's a process that we really have to write and have some people probably spend time on because that's true That we have nothing that's That's probably one of the legacy issue with the centos ecosystem That it was still based on the idea that it was just three guys doing that somewhere and So not committing a lot because they had no time to communicate a lot or even have time to unwrap other people That's one thing that we should solve now that we try to with the CI system and also on the infra level Thank you. No more questions They thank you then