 Hi, everybody. Welcome for this new Jenkins infrastructure meeting. Today, to the agenda, we mainly have to cover a few minor voltages that happened over the weekend over the past few days. The first one that I want to mention is, over the weekend, the job used to provide permission to publish new Publix was broken, and that job is running on trust.ci, which means that people who try to use the new workflow to release plugins was broken. Considering that we only have five plugins using that workflow, that was not a major issue, but still for those people who try to release a new plugin. I think one of the main, I mean, when I haven't really looked at what was the root cause, so maybe carrots, can I help on this? But at least what it reminds us is trust.ci is a very protected Jenkins instance with only a few people, where only a few people have access to it, which means that when we have an issue on that instance, it takes us a long time to discover the issue. So one way to detect problem there would be to configure a notification like sending emails when a job is failing or notify an RSE. So we definitely have to investigate some options to be notified about failing jobs on that instance. So just to remind people, trust.ci is Jenkins instance running in a private location, where we run really few jobs, but those are quite important. So typically we build Jenkins Docker images, we generate Jenkins.io website. We have a job's name, a release permission of data, which manage permission, which allow basically which handle the permission and grant, for instance, plugin maintainer to publish artifacts on our artifactory service. So typically that was the job that was failing over the weekends. So everything has been resolved for some reason that I still have to investigate. Damian had to restart that instance and lost the public IP of that machine. The public IP is used to establish connection with our LDAP service. So our LDAP service whitelists some IPs. And so because trust.ci IP changed, we weren't able to use it to authenticate under that. So yeah. So the outage on that specific machine had some downstream issues, but everything is back to normal. The thing is, the good point is, it gave us an opportunity to test the status service, so status.jinkins.io. So Danielle opened a PR on Jenkins.fra slash status to report the issue. What we identify is only one person had the ability to merge PR there. I mean, really few people because I was not doing one, but really few people had the opportunity to merge PR. So now I fix that by allowing more teams. So typically people from the core team are and people who are on call right now. So more people in the future will be able to merge PR there. Otherwise the process to report the issue was pretty easy to use, pretty easy to review. We identify another small issue which is we don't mention the time zone in the ticket. So I have to update the documentation to include time zone in the date because it was not really abuse for all long. The issue was reported. I think it was for several hours, like almost three hours that it took to resolve that. So Olivier on the time zone topic, is it okay if we express it in whatever time zone we're in? Would I be allowed to say mountain time? Or is it best that I always express it in UTC? No, I think the tool can handle the time zone. I saw PR regarding that. I have to double check if it's already included in the version of the tool that we are using on it. But I think, so the full back situation will be to use UTC if we cannot have dynamic time zones. But otherwise, yeah, I try to configure it for your, the thing is maybe it's a ridiculous, but I have to double check. Any question regarding this outage? So while we are talking about issues on Trusted CI, I just put another issue right before the meeting, which is the plug-in site is not updated anymore. So the job that generate plug-in site data is failing since midnight. I mean, since, sorry, since 16 hours. It will be easier for everybody. So since 16 hours the job is failing, the error is pretty obvious. It says that we are trying to fetch data that does not exist on the API. So there is a GraphQL error. So I have to double check with cave-in to see what's wrong there. What surprised me is in the past, we had a monitoring check that detects all hold where the data from the plug-in site. And this time the check did not trigger. So I have to double check with cave-in with the API on that we are monitoring change. Which is, yeah, which is possible because we switch from an SD search to alcohol. Yeah, I think. Well, and that monitoring check that we're using seems aligned with what Gareth had lobbied earlier that we should be user centered. If the plug-in site's not updating, that's sort of a user perception where we say, okay, yeah, the users won't see things that arrive. So that makes that to me at least a very valuable check. Gareth, help me on that. Is that the kind of thing that you were thinking of? Or is that still not user-facing enough? Yeah, I think that's, I think that would make sense. Yeah, rather than, I mean, the lack of a plug-in being able to download is the problematic thing, isn't it? That's something that somebody would experience. I mean, just going back to the, just going back to the altitude of the weekend, the kind of thing that you'd, whilst we could monitor the failure of the job, what we're actually interested in, is not necessarily the failure of the job, but the failure of being able to release. I'm not sure how we would find this out, but it's that particular GitHub action that is triggering the release that is failing because it can't upload the artifact. That's what the user would experience as the problem. Because the job could fail, but the credentials could be valid for, I don't know, six hours. I'm not sure how often they're rotated, but six hours, 12 hours. So there would be no impact until the point at which they expire. So an option would just be to monitor for how long the job has been failing. So, you see no jobs, if no jobs succeed within an hour, then the issues, the severity of the issue increase. But yeah, we should not be triggering if only one job fail because yeah, maybe for some reason. The thing is, right now, we don't have any notification configure on our Jenkins instance, like we don't send email, we don't notify REC. I tried to use the DataDog plugin in the past, but that was not really successful. So we remove it for now. But yeah, I think I'll end up configuring the REC notification. The reason why I like REC, if we can manage the notification to a pretty small amount of notification, it's useful to have those information from REC because we all have, always have one person available who could look at that. We've been using REC notification for the Puppet Master for a while, and I mean, this has been really useful. But we usually don't get, I mean, that's because we don't have a lot of notification, like we have maybe one hour of performance, two hour of performance. So we just have to do a check that we don't spam the channel. And with those notifications go to Jenkins.info or what? For the Puppets, for Puppets, change it goes to Jenkins.info. Sometimes we see something like Puppet failed on that machine. And so that pretty obvious that for some reason Puppet applied did not work on that machine. Most of the time it's just disks full or something like you try to install a package that doesn't exist. I mean, there are different reasons why the Puppet would fail. But yeah, usually because it does not regularly happen, we usually fix the issue like 15 minutes with solidification, look at it, fix it, and that's it. So we should definitely, but another thing we should start me creating some of the job running on Trusted.ci and Infrared.ci. Some could easily be moved. Those, for instance, that I have in mind are the job that generates JavaDoc, the job that generates Jenkins.io websites. Those could be easily moved to Infrared.ci. It also means that we've got a proper backup of the jobs as well because it's all in Git. Oh, okay. Infrared.ci is already managed as code. Yes. Okay. Yes. It's just that the image at the moment isn't, I mean, it's downwards the plug-ins each time. But after the chat with Olivia this morning, that will soon be rectified. So yeah, that was a minor permission issue that I had to fix this morning. So as Garrett mentioned, he has been working on a process to automatically build Jenkins Docker image for the Jenkins Infrared project. So the idea is each time we release a new version, a new weekly version or a new stable version, we fetch that information, build a new image. But also each time a new version is available for a plug-in, we update the Docker image containing that new plug-in version. So the idea is to directly ship that Docker image with everything packaged. In terms of stability, the thing is right now we just have in the current situation, so without the work that has been done by Garrett, in the current configuration, we use Gcast to install everything and we use a handshark. And one of the configuration in the handshark, we list the list of plug-in that we need. And each time the Docker image starts, it tries to reinstall every plug-in. So typically what happened is if for some reason, and it already happened in the past, for some reason we have to restart Jenkins instance because whatever the reason, we restarted that instance. And then we cannot install plug-ins because there is an issue with the third mirrors. And so it does not make sense because then the Jenkins, I mean, it would take us like 15 minutes to start the service, even if we didn't change the plug-in. But just because by default we remove the plug-ins from the installation, from the disk and we try to reinstall them, it slowed down the starting time. So that's what we want to change. And so basically what we were missing is key tags are at least a way to clearly identify which Docker image contain which version of Jenkins and which plug-in. So we can rule back in case of issues. So this is something that should be solved pretty quickly. And I think that's mainly that. There is no major changes that is coming. There is a discussion that I would like to start. So I can show that here. So the idea is I'm looking for ways to manage permission of contributors, of people who contribute to the Jenkins info organization and who don't contribute anymore. So the idea is to remove those, remove the right permissions to be sure that if they don't contribute, they cannot merge a PR. But at the same time, I would like to keep them in the Jenkins info organization so they can still provide. We can still notify them if we expect some PR reviews or whatever. So I've been thinking to create a group that would name alumni. And so by default, every person who don't contribute anymore to the project, I would just put them in that specific group. So we would still be able to notify them if we need some reviews. And their PR would still be considered. But at the same time, we would need someone more active to merge a PR. The people that have in mind are people like Tyler, Kazuki, Marky Jackson, or Fishery Stepdown. And so we have a lot of people in the Jenkins info organization who haven't contributed to the project for a really long time. And so, yeah. I don't know if you have opinion on this. Yeah. So the concept of an alumni group sounds really good. And so you say it would allow them to read but not merge so they can still see what's happening and give their insights. Yeah. So the idea is to give them read permission to public repositories and private repositories. And that's one thing. I'm maybe a bit concerned about the notification, because if they are in read only on every git repository, by default, they will receive notification each time someone open PR or whatever. But yeah, then it will be the responsibility of the person to opt out. Why would they get notifications? And why do they need redactors on a public repo? Is it just good? It's a public repo. So that's a public repo, but having them in a team would allow us to ask them to notify. So you could just put the name and they will receive a notification specifically. But you wouldn't use the team, though? No, you would not need to use the team. But for instance, if you want to, when you open, for instance, a PR and you want to assign that PR to someone or ask a reviewer, that person need to be in the organization. So they need to be in a team some way. Yeah. Yeah. They need read access for you to make them a reviewer, I think, unless they're contributed to the repo before, I think, or changed that file recently or something. But they shouldn't get any notifications by just having read access. They have to... Okay. Okay. So that's my only, I mean, fear. But yeah, if they don't get notifications, then that's perfect. The thing is, we have quite a lot of people that would be in that scenario. So... And so they would still have the small Jenkins Infra logo under their profile as well. But yeah, that's it. But yeah, I have to send an notification to Jenkins family later today. Otherwise, yeah. And the other thing you want to bring here, otherwise, I propose to finish the meeting earlier. So one time, two time. Yeah. Oh, go ahead, Tim. I was just going to ask, because anyone done the work or the access by plug-ins affected for entrusted CIO, as someone synced the plug-ins that were down for entrusted CIOs down over the weekend, as someone done the sync. So what, are you talking about trusted CIO in the release of that permission? So when entrusted CIO was down, whatever plug-ins were released from Friday to Monday, weren't it been published to the updates? Well, to get dodgy Jenkins.io. Okay. Yeah, I haven't, I haven't read that yet. Yeah, the update center by default only syncs the last three hours of releases. And it relies on the fact that it runs every three minutes. There's flags that you can run to the update center to increase that number. Or you can sync everything, but syncing everything takes hours. But if you could just, if you add a filter for like five days, it's going to be very much. Okay, so Tim, I'm not sure I'm getting it there. So the update center sync is hosted on trusted CIO and was down for the weekend? No. So the updates, so basically, supply and maintainers have released plugins. The update center is built every three minutes. The update center outputs a file adjacent file with recent releases of the last three hours of releases based on it just as it runs, it builds that file. And then it outputs that file and then it runs a script on package dot or package. Yeah. So, so basically that script could be manually run, right? Yeah. Yeah, it can be run manually run. But it's easiest to just get updates into the build build the list for you. Or you can manually check artifact if you want, but it's not the easiest. Because for me, I have the feeling that it would be easier to just run the script manually on the machine. And so we don't, we don't change the job of the update center. It's up to you. It's just a flag that you pass to it on how many hours you want it to run, how many hours you want it to output. Is it, is it also, is it also the script that uploads artifacts to get the Jenkins that are you? Yeah. Yeah. Okay, right. I see which script I'm thinking of. So, and so we have to be sure that it takes into account plugins since Friday. Yeah. Okay. I'll look at it. I think it would be easier to just keep, yeah, to run manually run the script. I'll sync with Daniel as well. Yeah. Daniel fixed it on Thursday. There was an outage last week as well. I mean, I think the certificate got under 30 days. And so basically the same thing happened last week. Daniel did it last time, but then trust it went down again on Friday and it needs doing again. So that's a good point that you remind, that you remind that I forgot to mention. So, so the update center when, when the update, so when the job, that update update center run, it tests that the certificate of the update center is valid. And it's although, I mean, it's valid for at least one month. And we have, we have to, to rotate the update center certificates. And we are planning to do that next week. So the 29 of March. But, but because the current update, but because the current certificate is expiring in less than one month, the job was failing. So Daniel just temporarily removed that condition about failing if the certificate is expiring within a month. He normally just changes the, he normally changes it from 30 days to 14 days or something. I'm not going to check how long he said it till, because it might trigger again. I told him that I wrote the certificate next Monday. So it should be okay. Yeah, it should be okay. I mean, the certificate is pretty easy to rotate. I just generate that and upload and update the configuration on trust. So I did that multiple time and each time like takes me 15 minutes something like that. So, but yeah, I'll work on that next Monday. So to do, okay, right. Thanks. Thanks, Tim, for that reminder. Any other last mid topic? So weekly release, I haven't checked to see if weekly release has completed yet. It was, I restarted it and it's running or it was running hours ago. It's now published, at least a part. So I'll run the weekly checklist later today to be sure that the Docker image is available, et cetera. We had, yeah, we had a small pitch earlier today with a timer issue, which was a timer issue in the Jenkins agents. So we had to re-trigger the build, but yeah, everything went fine the second time. The good thing is we don't often have issues with the release environment. So yeah, it's not like that's a common issue, but yeah, so we have to keep that, we have to keep an eye on that. Well, thanks everybody for your time. Yep. I just want to say, I don't know when it expires, but he's set it to expire when there's 14 days left. The job will stop bailing. Okay. And which will expire? Olivier, if I recall correctly, the certificate expires April 11. So we've got, so that's right on the cusp of that 14 days. Yeah. So we have to increase the dates, or we wrote it earlier. That's another option. We've re-wrote the certificates like this week. Yeah, it's going to expire the day before, looking at my calendar, unless something gets changed. So I mean, the job's going to fail the day before. Which is cheap to ask Daniel, or we could submit the pull request, give it 12 days instead of 14, right? So then, then we don't have to change. We don't have to have you working on a weekend, Olivier, to do that certificate rotation. I'd rather stay with 29th as the rotation day, if it's okay with you. Otherwise, otherwise we can do the certificate rotation on Friday. Yeah, and I'm more prone to Monday. Just personally, we announced it for Monday. I just assumed Monday. Okay, then ask Daniel to change the date. I just posted that. Okay, thanks. Okay, thanks everybody for your time. And see you in RC. Have a great day. Bye-bye.