 Hi, everybody. Welcome to this new Jenkins infrastructure meeting. So today we are six and that's awesome. Hi, everybody. So the first thing that I want to announce here before we start the meeting, we have a new player in the game. So Hervé just joined CloudBees, the community team at CloudBees to work with us on the Jenkins project. And he will start having us on the infrastructure. But yeah, don't overrun him with stuff. Just one thing at a time. So at the moment he's still learning and following us. To the topic, we have quite a few topics to the agenda. So the first one, which I've misunderstood the results, the Jira maintenance was not enough to fix the root cause of that maintenance. So Unicode does not work in Jira at this stage. So I have to do a follow-up there. So yeah, I need to contact Anton to see what would be the next steps there. Just for the joke, I have to remind you that GitHub supports emoji in the issues, just saying. We're not trying. I mean, I don't think that would be a killer feature to sweet from Jira to GitHub issue. Yeah, I mean, yeah, anyway. So that's something that we have to figure out next week. The next topic is about Jenkins Core security updates. So we had a security, we got a security release on Wednesday for the weekly and for the table. It was not really smooth because of different reasons. So Daniel was not available to help us on Wednesday. So we had to do the security release with Batec. And so it was not up to date to resist working environments. And at the same time, we discovered many issues that were not related to Batec. One of them was Jenkins issues with packages Jenkins.io. So that was the first one that was an interesting one. We struggled to allow Batec to connect on packages Jenkins.io. Mark could not connect on packages Jenkins.io neither. So basically all the people that were added to packages Jenkins.io manually after we stopped the Batec agent, those installations were broken. So we found the issue, we fixed that. The second issue that affected was the update center job, which is a requirement to do a release, whatever the type, but it's a requirement to do a release was broken because Jenkins agent could not find the Java path. So that issue was related to some work done by Damian. So Damian was able to identify the root cause there as well. So that's something that... So we know that we rely on a Jenkins instance named trust.ci, which is not defined as code, which means that when we do an improvement on one machine, that improvement does not affect trust.ci. And it sounds... That's wrong. Sorry to cut. Trusted is now managed. And the reason is because I pushed a change that was aimed at fixing issues that appeared a few days ago, but that change was worse than the thing we tried to fix. Otherwise, I should not have pushed a change a day of a release. So I'll culprit there. But yeah, we have configuration as code and we have synchronization now. So if we push something not working, it won't work anywhere. Thanks. But we still have some improvement to do there because many of the job are still maintained manually and stuff like that. So we are just at the beginning of the story of configuring trust.ci as code. So that was another issue. I was not able to follow the rest of the release, but it appears that it took longer than I expected. But yeah, that security release was a smooth ride. Yeah. So Vatek Filonia has been gathering a retrospective document. And I added a couple of items to it, or one item to it related to the weekly release checklist that doesn't exist and that detected some gaps. We had a formatting error in the change log. We had what else? Oh, and didn't get the GitHub version release of the weekly publish. We did publish the GitHub release of the LTS. So it's minor gaps like that and reminder that we need to think about it. So I can put those notes in. Okay. But do we follow the release? I mean, when we do a weekly release, we don't have a lot of steps to do. We need to publish the GitHub release. Yeah, that's the one thing that's most often missed. And I just need to discuss it with Tim Giacome because there may be an automatic way to make that happen so that we don't even have that step. Okay. Thanks. A lot of missing authorization. Vatek did not have all the authorization on all the repositories. I had some part and I spent a few hours just waiting in case he had an issue. For instance, he did not have the right to merge on Jenkins.io or to directly push on the master branch of Jenkins.io. Right. But everything is written on his retrospective, but that's also authorization issues. That means we can completely improve by having a checklist of check things before, especially can you do, do you have the authorization there or there? And there is one other element which was also very annoying. So the thing is, we don't have a very stable way to disable the weekly release. So I know that Daniel Vatek disabled multiple times the weekly release. And each time we made a change to the due to the instance, even minor, that's disabling the job was reversed. So even several hours before the security release, we had, I mean, even on Tuesday, we had to disable the weekly release because it was re-enabled again. And the problem here is mainly because we configure everything as code, everything is defined a git repository. It always try to go back to that definition. That means that if we want to see, I think if we want to disable a weekly release, we'll have to look at it at some point. But we need a way to easily disable directly. Because each time we disable the job, it's re-enabled after someone do a modification. Which is how configuration is code is supposed to work. So it just reminds us we've got another step. Yeah, yeah, exactly. Any last comment to that topic? No, okay, right. So the next one that I want to bring in this case, it's mainly for Damian. So Damian has been working a lot on the AKS cluster. Damian, do you want to? So we were hitting a limit in terms of number of available IPs. Because the IP, the subnets are shared between the physical machines, the workers, the virtual machines, the pods and other virtual IPs. So this morning, I had to do an operation to increase the subnet size to reach the limit we were expecting, which means 150 containers at the same time max, which translates to 50 virtual machines up for the cluster. So automatic scaling works as expected. The operation is finished. I will go back on that. It took down the AKS cluster during one hour and a half. And now we hit a new limit, which is the rate limit on the Docker hub. The reason, and even if we authenticate with the free plan that we have, we will steal it because it's the rate limit per IP. We have a private network of machines, although requests come from the same public IP from Docker point of view. Can you not use multiple range of IPs for egress connection? We could, but we will pay for that almost the same cost as paying a full enterprise Docker subscription account. So I would prefer going the direction of having the Docker images pushed on the Docker hub and somewhere else like whatever, because the cost per gigabyte is really almost nothing for a Docker registry. We could use GHCR or whatever. So we will have each image on two different locations. So the thing is, on most long term, it's more the question of do we really need that size in auto scaling? Because that costs too much. We are currently working on the cost for AWS, because mostly what we stopped paying on the 3,000 bucks per month that we gained on Azure have moved to Amazon. So the question is more, do we need that much? We know that we can auto scale. So now the next step will be to add more sponsoring capacity. We have digital ocean and scaleway that are waiting for us, even if it's two nodes cluster. We have the two OSUSL machines that could be revamped as a Kubernetes cluster. I would not invest more time in scaling the AWS cluster because we don't want to rely on one. That's the reason. So we can have a bunch of tiny cluster instead of one big that scales, especially we have static machines. We have sponsorships. The second thing is most of the build peaks these days come from specific builds like the BOM. And in the case of the BOM, all the pull requests of the BOM are rebuilt once per week, which means every week during the weekend, we have a peak of 600 builds waiting, and most of the time it's not even needed. So there are also solutions by saying, why don't we just disable the weekly rebuild of the pull request and only focus on the master branch because one BOM build can be triggered on one time. So these are the next steps, trying to have different sources for the Docker images to avoid rate emitting, adding more clusters to have different cloud sources for the container. So if we bring back a case like I did this morning, we can still handle container workload. And finally, disabled builds that are not needed. I propose that we work on these three tasks on the upcoming weeks. And then we do a retrospective in one month to see if we were able to decrease the cost and improve the quality of service for that one. Considering that someone can just from a PR rebrand the checks, isn't it just better to stop rebuilding all the time PRs? Because if someone wants to just work on an old PR, you can just ask to rerun the checks. And I'm sure there is an issue from Jesse Glick, which is five years old about disabling rebuild for pull requests. It might be disabling builds for pull requests at all, which in this case might be a bit too extreme from my point of view, but at least ensuring that the pipeline library and the Jenkins file of the EV builds are updated could be a great help to decrease the pressure on CI. And I think we're justified purely from a cost perspective, right? That we've got to bring costs down. And that's an obvious cost that we need to reduce. So then I found it sounds like we have an agreement that we'll try to use that objective. Now I did have contact with Gradle Enterprise and they have a test optimization technique that next week or the following week, I'm going to explore with them briefly just to see if it could reduce the amount of time we spend running tests on things that are unchanged from one Delta to the next, from one commit to the next. But I don't know if it will work. Jesse Glick was skeptical. He thought it may not actually help us. Okay. But yeah, it's anyway, it's worthwhile to investigate because it could be another improvement on top of the others because if we already stopped rebuilding PR, testing PRs that are not, which are not updated, I mean, that will already reduce a lot of... Yeah, it's still a good having both. Please note that the Gradle Enterprise stuff come from the new Gradle server, which is the same idea as the new brand new Maven caching server, as I understood. Most of the time, that part, it's always nice for the developer experience of the machine, but on the CI, it doesn't change a lot because we have big beefy missions and we have a nice network compared to developers. So most of the time, it's a local caching with a demon running on your developer machine that Gradle or Maven connects to in order to optimize the compilation, which is not what we can... It's not easy out of the box because we have ephemeral agents. Right, and they described that in their DevOps world presentation and said that they had something for it, but I think it's... Until we've investigated, I don't know if it will help us at all. Yeah, they do have stuff for remote build caches and it's supposed to... The demon will run on CI as well, but it just doesn't really help with ephemeral. We've always... We just say with the demon a long time ago at work on CI just because it was crashing and problems. It's probably better now. They do say you can have it on, but even it crashing once or once in every 50 builds is enough that we turned it off, but it's certainly been a lot of work done on it and remote build caches and stuff is supposed to make that easier, but is this just for Gradle builds? Gradle builds? No, they claimed it was also for Maven builds because I wouldn't be interested if it were just for Gradle. There's just not enough Gradle dependency in our infrastructure to care. But again, it's got to be evaluated. So Tim, you've already got Gradle Enterprise running inside your company, inside the employer? No, just regular Gradle. We use Gradle for everything though. Got it. So I may beg for your help after I've done some initial experimenting just to understand. They seemed very interested in working with us and getting involved with the Jenkins project. Okay, thanks. Next topic is about the recent change. So the Let's Encrypt Food Certificate changed. So I mean, that was nothing new. It's just like the totally deprecated old version. We discovered that it affected us on adaptive Jenkins that I know not because I mean, the certificate that we were using were always signed with a new one. It's just like in the end up configuration, we were enforcing the wrong rule certificates. So that affected us last weekend. We fixed that Monday morning. So that's one thing we still have some action to do with the Fastly account, because apparently one of the certificates that we are using in Fastly is still signed with the old rule certificate. I received the notification by email. So this is something that I have to look at there. But otherwise, it did not, I mean, that's the only thing that affected us because we are using Let's Encrypt almost everywhere in our infrastructure. And everything went well. Another funny story was yesterday. So next topic is about the Rackspace account. So we officially removed the machine there. So just for the story, we switched from Rackspace to Oracle Cloud back in August, because we could use the new Harm machine provided by Oracle. And because we don't pay on Oracle the network bandwidth, we were able to reduce the cost from $700 to $25 per month. And so we stopped our Rackspace machine at that time. And no, we stopped, sorry, we stopped using the Rackspace machine at that time. One week ago, we stopped the machine. And so we did not discover any issues. So we decided to delete the machine. And we discovered that we had to follow a security procedure to delete that machine. So we had to call one phone number, provide our identity, and we weren't then redirected to someone else. I mean, all that to say that that machine is now officially gone. So KK should stop being built on that account, which is nice for him. Question? Yeah, there is nothing. I see that. Do we have anything left on Rackspace? No, no, no. So that's Rackspace. That was nice because Rackspace sponsored the Jenkins project for a very long time because that machine started, I think, in 2014 with Ubuntu 12. So the sponsor ended last year in March. And because it was a very old machine, it was pretty expensive for us. So now we don't have anything on Rackspace anymore, which was $75 billion. I thought they renewed the sponsoring or something. Did they not end up doing that? They had stopped sponsoring, then renewed us for a period of 12 months, and that renewal ended last March. Yeah, so there was a period where they had sponsored us for years and years, and then dropped the sponsorship. We were surprised after, I think six or eight months, we asked for, Olivier asked for it, and they granted another sponsorship, but that ended and then they refused to renew that any sponsorship. What was weird is they did not notify us that they would not renew the sponsorship. So I was just collecting the various costs and discovered that we had to pay invoices on the Rackspace accounts since months. Because I review all our accounts every two, three months to see how we stand with the cost. And so yeah, the notification. But yeah, anyway, so one account less in the project. Any other topic that you want to bring? Mark, sounds like... So the Azure dangling VM, that was, I assume, a cost management thing. And I wonder if systematically checking for high costs may be something we want to consider so in that case it was low cost, it was less than 20 bucks. But these machines were just standing there with name of a function that were clear that it was not used anymore. Okay. And the same on AWS based on cost. I'm currently preparing something for the next next weekly about breaking down the cost on AWS now between the kind of instances on CI Jenkins.io. Is it a cost that comes from the AKS workload from IMM from low memory? And at the end of the month, we should see improvement on that area. Great. Thank you. Thanks very much. Last call. Last topic. We're six, seven minutes before the end of the meeting. So that's awesome. Thanks for your time and have a great weekend. Goodbye. Bye.