 Okay. So it's recording. Perfect. Everything out from my screen, moving to the team meeting notes. Okay. For information, I try to publish the meeting notes on community Jenkins, usually the Tuesday before just before starting the draft for the current day. Otherwise I will forget. Let me show the link on both IRC and here on the conversation. So in Zoom, here is the link to the collaborative notes and on IRC as well. Okay. Everyone there, everyone can hear me and see and read my screen clearly. Yes. Thanks. So let's get started. And everyone, welcome to the Jenkins Infrastructure Weekly Team Meeting. Today we have Mark Hervé, Stéphane and Heidemin. Let's get started with announcements. So the announcement I had on my draft notes is the fact that today's release, today's weekly release has been really successfully. So you can grab the version 2.337 of Jenkins. The Docker image is available. We should have our infra CI system upgraded during the upcoming hours. Are there other announcements or element related to that release that should be worth discussing? So the Docker images must have become available within the last 45 minutes or so? I think so as well. I haven't checked before this 45. Okay. When I attempted to run them about 45 minutes ago, they weren't there. So I'll do the poll. Thanks. No announcement? Okay. So let's get started. If it's not available, it's because I might have concluded too quickly that it's available because of some pull requests. Yeah. So, okay. So I don't see it yet, but I will continue running the release checklist. Most of the other items in the release checklist are successful. Okay. So the changelog is visible, for instance. And so there's just more still to be done. Okay. I'm just adding the notes related to the weekly release. I will explain later if we have time. Otherwise, it will be different. Okay. If it's okay for you, let's get started. So first of all, big thanks, Steve, for taking care of CI Jenkins. I made your issue earlier today, where the Maven container agents weren't started at all. So we had a build queue that was with a bunch of jobs waiting to be processed. So I'm the root cause because yesterday, by solving other incidents, I upgraded all packages and all plugins of CI Jenkins, IO and Trigger or Reboot. The weird part here is that the container were started and some builds were processed on both Kubernetes cluster. But this morning, it seems like that no container was spawned at all. No error on the logs as the Stefan on there we checked. So it smells like a race condition somewhere. They met a lot of stack error. By the way, while I'm thinking, if you don't mind asynchronously to add the screenshot of error stack, could you upload them to the helpdesk issue? There wasn't any sensitive data on the screenshot. So you can totally go ahead. Most of these error stacks were related to classes, implementation classes missing, one time for EC2, one time for classes used on the init groovy script. The common point where these classes were provided by plugins, as if some plugins weren't loaded correctly during the class loader part. So that's really weird. Sounds like the classic method of reboot it. Are you sure you don't want to reboot? After two reboots, CI Jenkins IO was okay. The build queue was processed quite quickly then. So thanks folks for fixing the issue really quickly. We might want to operate CI Jenkins IO first a reboot of the container, then a reboot of the VM to see if the problem happened again and if we can see any error log if we can reproduce that issue. If it's a race condition that might be random, but yeah, there is still an error where we don't know and we cannot conclude. But it smells like by upgrading all the plugins yesterday that could have triggered that indirectly. We cannot be sure. The reason, at least the main reason was because yesterday we had the Poop-A-Tron failing for CI Jenkins IO. The issue was not on the Poop-A-Tagent machine, but on the fact that Poop-A-Tagent asked the Poop-A-Tmaster to trigger some historical backup of the configuration files changed. Each time there is a change on the GitHub repository, the agent pulls the files from the code repository through the Poop-A-Tmaster. And if there is a change detected, Poop-A-Tagent informed the Poop-A-Tmaster, which takes snapshots of these different versions. So we can always roll back even if we lose the GitHub source. So we are limited by the space on the Poop-A-Tmaster, of course, and it's rotating. It's correctly, it should not use more than 10 gigabytes on that machine for that specific feature. So we won't feel the hard drive. But the snapshot backup failed on the Poop-A-Tmaster because the location where it was trying to backup was owned by a route instead of the Poop-A-Tusor. So I applied correctly the new sets. These files were four years old, the set of directories with the bad permissions. So I have no explanation why suddenly these directories were used, but it was a kind of annual, it was a bucket. Then there was a five level of subdirectory, each one named with a single letter. Seems like something they were trying, forgot the word name, like it's like a cache with a tree. And I assume that we went on the bad part of the tree. So I recursively applied correctly the authorization and permissions, and that fixed the issue, and it un-audits each file to have the same correct permission and mask right now. Everything has been written on issue. If it happens again, we should have to check that. But I took the opportunity to tackle this one because not everyone is at ease with the Poop-A-Tmaster. I dream of getting rid of that virtual machine, which is partially automated managed. So it's a kind of Frankenstein. But right now, I didn't want to bother all of you on that part. We had an issue with the VPN. The server-side certificate for the VPN expired the 26th of February. Sounds like both Olivier and I missed to have an event to remind us a few days or weeks ago. We spent the day because we're also missing documentation on how to regenerate that certificate. It's very easy and well-documented on how to generate client-side certificate, but not the server-side. And the attempts with the help of Stefan, we were able, both of us, to generate certificates, but like classical certificates as we will put on an Apache or engine server manually. But in fact, we were missing some specific OpenSSL extended attribute expected by OpenVPN. Thanks to Olivier. Olivier jumped on a call to help us. He pointed us to the location. So it seems like that it was not as easy as he and us were thinking, but we're able to make it success, and it has been documented. So unless I missed something on the to-do list yesterday, don't hesitate, Olivier and Stefan, if you see anything missing on the new doc, I might have missed a little part. But now we are sure that we know how to do it. We have the low-level commands. We can still improve the easy VPN go-long CLI to take care of that specific case, but that's one common difference. And I made sure that both Olivier and Stefan have the access to the secrets embedded, and everything has been written. So sorry for the inconvenience to anyone impacted by that. It has been fixed, and the calendar is now up to date with all the certificate or associated element for the VPN at least. At least all the ones we were able to get the date. Also issue on trusted CI Jenkins IO that caused a delay on running update center, JSON updates, and repository permission updater that was in charge of setting correctly the correct credential for the release for the contributors. It was two days of delay, the jobs. There were a queue of 40 builds waiting, half were only GitHub reports, infra reports, let's see later. However, user had issues. So we opened an issue and we went to the root cause that trusted CI wasn't able to spawn virtual machine agent on Azure because the credential used for Azure was expired. That was the Monday of exploration. So cleaned up. You can see on the post incident below on the meeting notes. The credential was rotated, updated on trusted and everything went fine again. Rebooted the virtual machine to force the GVM to garbage collect the hard way. And the user confirmed that his issue was resolved. It took less than 40 minutes. So good one. About the exploration self-team improvements. We tried, or at least I tried to fix my failure on the past months to add these expiration notes. I'm sure we are missing some, so I'm already sorry in advance for the upcoming ones. But as a matter of fact, we should be strict and I try to be strict myself on each time I see that kind of incident. I immediately had a notification on the team shared calendar with a link to the help desk issue. My notes are not always complete, but at least you have dots. You can connect the dots next time it happens if I'm not available or for myself. It has been done at least for all the Azure credential used by Jenkins instance to spawn Azure container ACI or Azure virtual machines. I took the opportunity to remove all the code valet, R-Tiler under black or that kind of pattern name application and all the Azure application that were more than two year olds with secrets that were expired since two years or more, which means this application weren't used. So let's close them. There are still one or two that I would like to audit correctly, but I made sure that there was no more expired credentials for service principle applications. And we spend sometimes this morning to be sure that Stefan and Derwe have the same rights as I am. We need to finish that part on Azure portal. So one of them, so meaning not me with my full admin stack, should be able to rotate last credential that is sensitive on that area, which is infracia is Azure Packer credential used to build a virtual machine with Packer. That one expired in two weeks. So calendar has been headed with notification today. And the goal is a RVR Stefan should be able to rotate the credential and the dates for the sake of knowledge sharing and ensuring that they have the same permission on Azure as I have. Did I miss something folks from the discussion this morning? Oh, that's exactly that. Yeah, good. We said we can't attribute a role to groups. We have to attribute role to member to user since attributing role to group and allowing a group to be a member to the relation between group and user instead of user and role is possible only when you have a premium as your directory subscription. But since we haven't that many people in this Azure directory, I don't think it's really a problem. More manipulation, but not that much. And Damien also, so Terraform provider to deal with Azure directory. So we could have this configuration as code to someday seems that Terraform might be able to manage this part. And we enforced enabled and enforced multi factor authentication for any user that need to access the Azure portal. Because we did it. Yep, go ahead. We discovered by adding MFA to my account. I still was able to log in with only my email and password. You have to enforce MFA. So it's the one that can get an account. It's for the MFA use. And so we should work. There are documentation on Microsoft Azure explaining we use that to enforce it per user, which is the easiest and the safest. However, we should enforce that globally that everyone trying to connect to our portal should must have a enforced MFA. The thing is, it's risky because we might risk to lock ourselves out. So the documentation explained how to do it on a test subgroup of user before generalizing, but we wanted to do it step by step. At least right now, everyone is enforced. Maybe per user is fine, like we did, especially if Terraform allows us to manage that part. Thanks for the reminder and the help. Next topic, Digital Ocean. Hervé, can I let you describe where we are? What has been done? What has to be done? The implementation is done. The cluster is added to CID Jenkins.io jobs with the EKS cluster on AWS. See what we need to do is to be able to measure costs. I'd have to ask Kevin or someone asked Digital Ocean why we don't see the billing page in AWS Jenkins. There are some documentation to be completed. For the sponsoring, we should add a mention on the sponsor page and maybe make a blog post to announce it more globally. I wrote the announcement on Jenkins.io Dev mailing list yesterday. So that should do it for Digital Ocean. Great job. Seems like the cluster is eating a lot, especially when you have a build queue that suddenly goes in a wave of builds. So that means it works very well and transparently, at least from what we can see. Maybe some users don't think so. In that case, please report it if there are any issues. Second, just a reminder. I don't know if everyone received the email from Digital Ocean yesterday. Since we enabled the automatic patch upgrade for the cluster, they sent us an email saying seven days that will be patch day. So our Digital Ocean cluster should be upgraded in six days now. Just a reminder. It's their own Kubernetes version, not a Kubernetes patch. The policy is almost the same as Azure, which is as soon as they have validated a patch version of Kubernetes, they put it on their own version. But sometimes they also transplant back, fixes on a more recent version if they can put it on the current one, especially for security issue, which is really, really practical. That's why it's enabled. And it should not show any outage unless something breaks seriously on their own. Because as I pointed out, there is what is called a surge upgrade, meaning they start by allocating a bunch of machines on the new one, then they migrate our workflow, and then they remove the old one. And we pay way less than what we really consume during the surge. We pay a bit more, but that's worth the non-outage effort. And if it doesn't work, we still have EKS handling the builds. So again, really great job on that part of it, and thanks for that. So now time to write about it, not the SES part. Is there any question or things that we forgot or things to do on Digital Assign? Last week, we had a request from the security team to add Windows Agent on ser.ci, which is an instance they used for security purpose, not used for any release. That's their own CI. That instance is managed like trusted.ci and ci.g. Stefan helped untook the lead on that topic. So thanks a lot. That topic really went clearly out of this scope for numerous reasons. The first one is that we have no explanation, neither them or us are able to explain that. But as soon as the LTS update was applied, and the Jenkins instance has started in the container, the markup clouds in the config.xml was removed. So all their manually managed configuration was deleted. I should have been more careful because Stefan asked me and I did that like a cowboy. So I'm sorry for that. I should have back up to the config.xml. I did not expect it that to happen. So listen, learn for next time. Sorry for that. The thing is we had to recreate the templates. So it went from add a new template to oh, maybe we should think on what was the previous setting based on watching their pipelines and asking them questions. However, that was a good opportunity for Stefan and I to channel edge on all the prepaid agent templates work. We were able to cover a lot of elements. So it was good for the team bonding and team knowledge sharing. And also it was a good opportunity thanks to a bunch of ideas that Stefan emitted to propose an, let's say, improvement on the way we manage labels and tools. So there should be a formalization of this proposal that should be applied to see agent in SIO if it's okay for the end user. But the idea will be to keep the dimension expressed through the agent labels to kernel-related elements. For instance, the operating system, Linux or Windows, the cloud, the kind, is it a virtual machine or container? All of these elements are kernel-related. The limit being Docker. Docker require a kernel most of the time. So Docker is on that area. And then for tools like the GDKE, Maven, these tools can be installed quite easily. So they should be managed by defaults, by the tools on Jenkins, at least on CI Jenkins or the standard instances. What I say might not be applicable for infra CI. That's another topic. And we were able, thanks to what you showed me, Mark, a few weeks ago, to define a kind of logical pattern for the tools. Because we were blocked a few months earlier on how to say, depending on a kind of label operating system, how should we install these tools? Second point, I didn't know that we could not download the tool or just express a shell script and set some variable, and that's all. With that new pattern, the tools are trying to use the locally installed GDKE, because all of our Linux and Windows virtual machine templates provide free GDKE, 8, 11 and 17. So why download a tool that we cannot control and check the provenance during builds? It's slow as done and in terms of security, it's not really nice. So instead, we try to use the local one. However, we could still use some fallbacks to download or to use other paths for the machines that are not managed by our template, such as the PowerPC machine, for instance. So convention, fallback system. And since we use pipette templating or Elm templating, we can cover all corner cases. And better, we duplicated things to the template gdk8 and gdk-8, so we could add added value for the end user. If they do a typo on that gdk, that will still be covered with exactly the same definition. So we did that on a neural scope, which is third CI. Based on that experience, we could clearly propose a new policy for CI Genkin Sayo in the upcoming weeks. That would simplify our, let's say, most of our cases and that will be easier for end user. We could document that or automate documentation. So even though that things went quickly out of scope, it was clearly a great learning session and it's a great opportunity to improve the CI service we provide for contributors. And the security team agreed to upgrade their pipeline in accordance to use them. So that's cool. That's really cool. Let's work for us. So sorry, security team, if we hear that. But they volunteered and told us that it's okay for them. They haven't started to use Windows Image, but yeah, I hope that will be useful for them. Thanks, Stefan, for the guidance and the leading topic. Great work on that one. So Damian, I added some notes there about implied labels that where our duplication of JDK8 and JDK8 could actually be resolved by the implied labels plugin that lets us say one label implies another. However, it's not configured as code with jcask yet. And so that gets in our way. I think the technique you used is the most configuration as code. So let's continue forward with that. I like it. Let me add also we could ease the automation of labels for agents with another plugin that will show me mark, which is the platform label that by default define labels with spaces, but could be used with dash like what we do. So it could generate the matrix and all the cases based on the properties from the agent themselves, which is really useful. If we need it, the tool, it is available. Yeah. About labels, I remember JC commented on somewhere that the Jenkins files of the Jenkins repo could have this is Maven label renamed to Maven 8. So we wouldn't have the if-else in our shared pipeline to have the Maven only label and get Maven 8 everywhere. Yeah. Let me add the comments from JC. Thanks for that reminder. I have it on my favorites because that's one I want to take. The status right now is that we already have Maven 8 defined along Maven for the container. We need to check if it's also the case for the virtual machines. As soon as we have Maven 8 everywhere, we don't. I saw that this morning and I was about to add it. Okay. But not VM to do. Once it's done, that's a pipeline library change. That should be transparent for our end users. And after one week after changing the pipeline library in production, we can try to track the remaining jobs using the legacy Maven before deprecating it. And we can change them. Thanks for that reminder, which makes sense because here we might have Maven 17 Sonino. And what about infra reports? That has to be migrated from Trusted to InfraCI. So every high split the workload on that one. There are two matters. The most important one is the switch the GitHub bot user to GitHub app and fix the permission issues, which consequences the issue 2788 raised by Raul. So we have to work on that part to avoid authenticating to Jenkins CI organization on GitHub with a technical user and using instead a GitHub app. So work in progress on that part that means changing the way we retrieve the token. So that part is under by our way currently. And on my side, I'm working on since that job need to run regularly and Trusted CI is facing picks as recommended by Daniel migrating that job to InfraCI instead is better. That allows us to also gain some money because it can be long running for some parts. So using pods on a virtually infinite capacity adapted to the right sizing on InfraCI would help us and wouldn't block other bills on Trusted. Bump and bruises. I won't go into details unless you want to discuss that. But we have reached a point where the container less rootless IMG used to build Docker without Docker on Kubernetes pods isn't working anymore. So either we change tool or thanks to the work that Stefan did, we could switch totally to using a fully fledged Docker engine running on FMRL virtual machine agents on InfraCI. Because right now we have the same setup for InfraCI release CI, CIG Trusted and now a third CI. So let's use Docker. Let's use a full feature and stop using having edge cases. And that part would unblock executing Infra report on InfraCI because it needs specific Docker images that we cannot build for now. It's four past four. So is there any other priority topic for the rest? These are minor so we can delay to next week unless they are important. Yes, notification can be. It's okay. IP table, it's done. I've closed the issues. I have seen notification, the topic of Puppet agent, but it works. Thanks, survey. The missing piece was the Puppet master, which is half automated, half manually managed. And we add to Puppet to run the Puppet agents on that machine. I have to draw. See a letter mark. Thanks. The Docker image is up in the Docker. Two, three, seven, sorry. Three, three, seven. Nice. Thanks. Okay. I propose that we delay after that, but IRC is done. Delay to next week. If it's okay for you. Perfect. Okay. Let me have a seat. Not ifs. Let me stop recording, stop sharing, and we stop recording.