 Okay, hello everyone. Welcome to the Jenkins Infrastructure Meeting. We are the 28th of March 2023. Today, around the table, we have my self-demand portal, Stefan Merle, Bruno Verrardin, Kevin Martins, Hervé Le Meur, and Mark Waites. Let's start with the announcements. The weekly release 2.397 is currently being packaged. The release and the signing of the war went perfectly fine. The packages, though, were in failures. We are currently watching the last builds. We had to fix some elements. Some issues came from poopette management changes that I did last week. And some elements where we under-evaluated the amount of frequency of IP in a subnet that we are using a new instance. Overall, these are all just minor hiccups that were a bit stressful for us. However, everything is overall going really great. The new release CI controller on the new private cluster is behaving as expected with agents on their own subnets, different from infra-CI. So we are on a very nice date. That's a nice job. So I hope we should be able to finish the packages after that meeting if it's not already finished. When I started the Zoom call, we were waiting for Microsoft Windows to pull another Microsoft Windows Docker image. And this is 10 to 20 minutes, so we have plenty of time. It's synchronizing now. So when the package is done. OK, so it's in a good direction. We could have issues at the end of the packages, but the critical part, which is the release part, has been done successfully. I haven't checked the war signature, though. We will see in a few minutes. The war signature should be signature. OK, the war, the war signing is still using this Digisert signing certificate, right? So it'll be valid only for a few more days. The Maven is fed both the GPG key and the GT search, and it's using both. I see. And could someone launch the Docker container build process? It's far enough along now, or maybe it's already been done. We have to wait. We need for the package to be for the package steps. Step if I'm not mistaken. Oh, I thought it only depended on the war file. Published to Artifactory. Yeah. Can I ask someone with access to trusted CI to run the controller master branch? Yes. Thanks, folks. Just makes it faster for me to get started on the tests. Yeah, so 397 is not yet published as a container image. Packaging is eating a few minor hiccups. Only delaying of one or two hours. We have fixed issues. Next steps, docker packaging and the last checklist items later today. Just to be sure, it's a master in controller. That's the one. Yes, thank you. Is there anything else about the weekly release? Nice. Can I ask someone to help me on the notes? Just what the blog post. Instructions. I consider the announcement part of these notes as another way to communicate these two users. I don't want the tool stop being smart. I am not. Okay, so that new GPG key is that's a second announcement. We have a new GPG key valid for three years. That will sign the new Jenkins release. So the weekly today used that new GPG key. So uses it. And the next LTS next week will use the new key. There is no problem for you to import that new key already today. And so when you will upgrade to the new weekly or LTS line if next week, that will automatically pick the new key ID. There is a blog post I did in the notes. Please read the blog post to follow instruction for this one. Do you have other announcement folks? Nope. Okay. Sorry. Yes. Go ahead. It's the last day to vote for your Jenkins contributor today. Absolutely. Forgot about that. Yes. As day to vote for your page Jenkins contributor. Absolutely. Forgot about nomination and vote for that. Thanks. No more. No other announcements. Okay. Upcoming calendar next weekly. Next week as usual. That will be fourth of April. The new LTS will be fifth of April. Is that correct? Yes. Already written. Thanks. I'm trying to homogenize the dates in American standards. So the next LTS will happen Wednesday next week. That should use the new GPG key and hopefully the new DG certs. Certificates. I hope. I haven't seen any announcement on security advisories. Let's check all together. The last one was the 21 of March. So none. Next major events. We have Devox France in Paris. April 12, 2014 and 15. 12 to 14. Yeah. Yes. Do you have other major events? We had some last week, but I don't remember. Nope. Okay. Anything else to add or can we proceed to the tasks? One, two, three. Okay. Let's start. So this week. Wow. Where are we able to achieve? Thanks. I'll expand this for taking care of maintainer request about the gradle plugin. There was an issue about dependency on the build of a plugin. Thanks for managing that part. The summary is. This plugin was using a dependency and remote Maven repository, which is not mirrored inside repo Jenkins CI. That's an external one. So since we put everything on the ACP, as expected, the bill failed because it wasn't able to get that dependency. If I understand correctly, you on short term to allow the contributors to cut a release and fix the builds, you added an exception with the ID they use on the POM XML for that repository. So ACP does not, is not used on bypass. The builds are directly eating that repository, allowing them to get that thing. There is a question asked on the mailing lists about should we add that external repository as a mirror inside G frog. So we should remove that exception. Or should they switch to it's a Jackson API dependency. So should they switch to a normal, the new group ID artifact ID that might or might not be on G frog. I don't know. But we have a point where we have a plugin that come from, that come from an external repository. And in that case, they are using external dependency that we don't control. So there is a point here we should follow up with these maintainers to avoid any external repo. Does it make sense for everyone? Okay. So let's follow up on the mailing list if it's okay for everyone. I don't check if we got answer, but that should be the follow up. Because that start to move outside the, the infrascope unless we need to add the mirror in G frog. We were able to close the apply to docker open source program. Just a side note, I still sent an email to docker because Jenkins infra and Jenkins forever docker organization. We're expected to already be on part of that program since one year. I'm not sure if they forgot or there wasn't misunderstanding. We closed the issue because docker went back and finally cancelled the depreciation of the free team, which these two organizations and the legacy Jenkins CIO are using. Anyway, I will just push on docker to see if they can move this organization under the OSS. Because for us in terms of security management that will allow us to grant more than free administrator. And that should also avoid rate limiting for these images, which is quite useful. Those two organizations could be moved to alternative. The good thing of that issue is that we raised the question maybe we should switch the images to a GitHub container registry or another one depending on the main, the main question is what kind of role base and permission do we want? What kind of her back pattern do we want? The problem of docker organization is that a lot of we don't have the scope of each repository, each image. We cannot separate concern while we can on GitHub access. No, the issue is closed. No action expected. The rest will be docker putting them on OSS and we can think about this repository pattern later. Next issue updates center job is failing. That was an old issue we had to add as code in some changes in particular the fact that most of the package machine has some untracked dependencies. For instance, Blobix server, which is a common line tool used to synchronize plugins call release from that machine to another location. That tool is a kind of her sync, but for Microsoft Azure Blob storage. And that tool, even though could be replaced by the easy common line as a very open issue for that, that tool is still required by the mirrors and the scripts. So we had to move this as code to avoid by surprises. Why wasn't it as code because it's a legacy thing that was done manually at least five years ago. We have been hit by this one. Please note that by fixing this, I created issues that we had to fix. We lost the authorize key on that machine for the mirror brain user. I still don't understand why this user is used for the release process, but that broke the packaging process earlier today. We fixed the issue manually on the machine. So expect a new pool request to add the authorize key back inside the system. We were about to close the issue now that everything has been fixed as code and Blob XFR is updated to its last available version. So the next step will be modernization with either easy command line or a container or both. No short-term action expected here. Any question? Jenkins Jira issue. I don't remember what was that issue. Okay, someone created an issue and Jira and ask for deletion. Thanks Mark for managing this one. I'm glad that the user was willing to just have it deleted. I saw that there were other ways to delete history, but they are much more heroic. Deleting the bug was easy. We had two issues closed as not planned. One issue was wrong issue tracker and the other I closed it because after one week without feedback from a user requesting an email for the account Jenkins IO no answer back from the user, no email. So I've closed the issue. I never know if it was a naive tentative to eject an account or if it's someone that just was fed up and stop trying to create an account. That might be that. If that's sorry for this, but that's why I'm closing the issue after one week without answer. Of course, if the user reopened we will fix the issue. Did we forget something that we were able to close and we can forget about? How can we go to the work in progress? Let's switch to the work in progress. This week was, yeah, it's pretty intense. First, new GPG key. So a few details that we learned along the way. First of all, we must use an RSA RSA key, the new cryptographic algorithm that we use for the first tentative for a new key around supported on the reddit distributions. So for clarity, ED 25519 calling it a new cryptographic algorithm is is a bit of a stretch. It's been around for many, many years, but red hat has not included it, whether we like it or not. But yeah. It's hardly a new algorithm in my world, but hey, I understand they don't support it. And so we're stuck. Oh, for being clear, AWS only support that for SSH servers from cloud in it only since one year. It's quite recent, right? Don't get me started on why. Yeah, okay. So that new key has been added next to the existing current key that expire only Thursday. So we can we can switch in and off. We had updated the release properties process that went with the new key. Thanks, Alex. Thanks, Mark. Thanks Kevin for working on the different communication channels. We might have forgotten some but more communication is won't kill here. So we have a blog post. We have the whole package release process that should have been updated for the weekly so we can verify if the new HTML static files are okay. There is a, there has been a tweet. We announced that on IRC and Gitter. We have a lot of comments and community Jenkins and also pull request on the Jenkins documentation. So we have a lot of communication channel. If you see others, don't hesitate. Thanks for adding on status Jenkins as well. That will stay one or two weeks. That's also a good idea. After I propose that we send an email to the developer mailing list and infrastructure today after the weekly if it's okay for everyone. I will take care of this one. We can also add the blog post to the carousel to the website. Oh, good point. I ask you to either raise an issue on Jenkins. I think we could invite Kevin, Kevin Martens to do that because Kevin can he may never have done a post to the to the Jumbotron. So all the better. This sounds like a great opportunity for Kevin. Sorry if she used the word opportunity in this case to mean volunteer opportunity. Yeah. Exactly. Kevin feels all untold. That's right. This issue will be close able after the LTS release will have been generated with the new GPG key. And after we will have added a calendar events in three years with a six month reminder just to be sure that we don't forget it and we have enough time to make the announcement in the future. Well, and thank you for putting that suffix on the on the file name because now we can the next time use a different file name and we can just keep doing that. Thank you. Thank you. Let's give back to Caesar what what is owned by Caesar. I get the inspiration from data documentation. I've added the link on the issue because they have a very well explained process for rotating their package key. They did that last year and we had to update. And I found their instruction pretty clear. And they kept both keys. Also Ashicorp did that in 2020. So that makes sense when we rotate key to have shorter time three years is quite enough. One one prayer is a bit of pain for end users. Three years is interesting. And when we keep the two, the two keys, then we can have a smooth change for end users. So is there something else about the GPG key? So that one moves to the next milestone. Obviously. Until it's close. Close able. Next issue. Introduce an artifact caching proxy for C. I. So it appear that we had issues on the bomb builds in. Sounds like a very weird cases. Since we removed the easy to agent on decrease the capacity of C. I. To run on AWS agents. Most of the workload, particularly the bomb builds are running on digital ocean. Only on digital ocean. These agents use the local ACP that runs inside digital ocean. And as we understand, almost all the bomb builds using this one are failing with a word error. The error looked client sites, but we cannot be 100% sure. As per every research is no, no error were seen on service sites. So we are trying to reproduce to be sure that it's not that we lost the role of the board were terminating or something happened. The initial research is from her. They are importing our pointing to HTTP client in Java with a multi threads that are that are not closing correctly the connection at the right moments. And that's on the fact that it happened after a long time. So Java process run for a long time before starting losing elements. That could or could not be on that direction. Another thing to check will be to, to see if we can un-cache the artifacts that are in error, but I'm not sure and I haven't followed up. If it's always the same artifact failing or if it's a bunch of unknown artifacts. That could also help. That's the current status right now. There is no blocker in the sense that the bomb build can be run without the ACP, but it's really more than just an annoyance because yeah, that's, that means directly eating the bandwidth on chief rock, which is counterproductive for us. That that won't go not ice since I understand that 14 terabyte of bandwidth might disappear magically next month. That's a top priority item to be fixed. I raised there's something else new on that topic. I don't think you had the material time to deep dive given the weekly release, but I just wanted to ask if in case I forgot something. Anything else to add or is it okay for this one? Okay. Thanks folks. Thanks for the work. So I'm adding that issue on the next milestone as usual. Next issue is add cluster, a new cluster, private gates. So that one is almost plausible if I understand correctly, because today we validated that the new release CIGN Kinsen Stance that was migrated Friday from the legacy public cluster to the new private cluster worked quite well. There are some tin up steps remaining, but yeah. And then so that one should be plausible during the next milestone. We just validated that it worked as expected. There isn't any other service to be migrated on the private cluster. So once cleaned up, we can proceed with the public cluster that time. So great job on this one. That's a long running task. Any questions, things I could have forgot? Okay. So let's move that one to the next milestone. Next issue, EC2 are not available. So what happened is that a combination of misconfiguration and world behavior of the EC2 plugin for spawning virtual machine agent from Jenkins controllers created a lot of long running machines that cost us a lot this month. So in reaction, in order to leverage the amount of billing going in AWS, we removed any kind of virtual machine from CIGN Kinsayo. We drastically decreased the amount of scalability capacity for CIGN Kinsayo on container agents from 150 max pod to 30. And Stefan finished earlier today on infra CIGN Kinsayo, the virtual machine that we use for running Docker, the Docker commands were using EC2 while the controller is running on Azure. So for that one, we are definitively getting away from EC2. So now the Linux Intel and Windows 2019 agents are running on Azure VM. So thanks, Stefan. Everything is as code, including the credentials. So the last miles are now studying the possibility to use IRM virtual machine on Azure because we are using since one and a half this kind of instances in directly inside EC2. But now Azure supports that since December. So Stefan is working on trying to build new images that might need a bit of configuration on our sites, but that should allow us to stop using EC2 at all from infra CIGN. That would be a nice thing. Did I miss something or something else to add? Nope. So thanks, Stefan, for that work. If it's okay for you, we should, if you can report on that issue, for me it's closable because the next step will be a separated issue about the IRM port that deserve a whole issue. So if you're okay to get to how the report there listing what kind of instances did we removed? So we should be able to close that issue because the initial problem is gone. So for me, that one can be closed and there isn't any work and your mission will be to open a new issue about the IRM migration, of course. Looks good. Is there any question or something to add or that I could have forgotten on that topic? I think it's good. Next one, out of space and CI Jenkins agent bump builds. So, if I understand correctly, the last steps are adding volumes for slash TMP and the Jenkins on the pod templates and once validated, we should be able to close the issue. Is there something else to add on that issue? So I'm adding it to the next milestone. Is that okay? Yes. Realign repo Jenkins CI or mission summary, nothing done, still to be done. The expected work is working on a highly available LDAP, which I wasn't able to work on. I'm adding it to the next milestone. Next issue, credential for CI Jenkins IO expired. The goal is to manage as code on the Terraform Azure all the credentials and associated resources that CI Jenkins IO uses for its agents. By starting to work on this task and importing, I nearly broke and deleted CI Jenkins IO as a consequence of the production issue we did. So I'm sorry for that. I should be more careful and I will try to not delete CI Jenkins IO next time. Right now, CI Jenkins IO is currently in the incident. It's not able to spawn virtual machine. So we should fix that issue today after that meeting. One point, this one is spawning virtual machine inside an Azure subnet, which has all the permission and working as expected. That subnet is not tracked on Terraform as well. So I will try to also add it after the issue is fixed and the production is back. So moving it to the next milestone as usual. We weren't able to walk about the agent is stability raised by James Nord. However, it did not answer. So I'm prone to... That needs more diagnostic given the current state since we removed the EC to agents. So if you know one object, I will move it to next milestone with no answer from James and no time to diagnose. I will close because it wasn't reproduced. Any objection on moving it one week time for us to carefully check with the current CI Jenkins IO status? Okay. Ground-limited access to release CI to some security team folks. That one is partially done, but we still need some work on that. Is that correct? As I understand... Sorry, go ahead. It was postponed until we emigrated release CI. Okay. Can I ask you to report at least with links to the pull request or issues about VPN access because that was required for Kevin? And we need to check that Yaroslav either already has access to the private VPN or open an issue for that one. And then we will have an airbag set up to do on release CI, but the VPN part was treated by URV. So that's why I'm asking for just a quick report if you don't mind. And we should have the time to finish the permission part on release CI this week. Any questions or things I forgot? Nope. Next step. Sorry, document, code signing, certificate and renew the signer certificates. So the status is, yesterday we received an email from Fatih from the CDF. So it looks like that the certificate is being renewed. Thanks Mark for putting the emphasis on that we need that as soon as possible. I'm not sure if we will have the new certificate before the 30th of March. However, Fatih was positive that we should be able to do it before next LTS. Ideally, if we can have it for the next week, that should be perfect. For a reminder, the impact will be on people using the Jenkins.msi installer on Windows. Starting Friday, 31 of March, they will see error about the fact that its installer is not signed by Microsoft or by a trusted developer. And the people who are checking the war file, not only does to be checked through GPG, GPG key for checking the metadata, but also it's signed by that trusted certificate. So if you have that kind of process, you will have error Friday until you update to a new version with the new certificate. So these issues are as well moving to the next milestone automatically. Any questions, folks? No. Okay, so now let's review if we had new issues incoming since last week that we haven't tried HD yet. So I'm opening the list of issues on the L desk. So someone named Stefan Merle created a new issue about Adterraform role and permission for Azure on trusted CI and third CI. So as a reminder, that issue is to track what we learn on third CI. We should merge it back to trusted and vice versa. The goal is to work on our Terraform Azure permission model. Stefan, is that okay if we add this issue to the next milestone? Yes, please. Cool. Thanks for opening the issue. I'm not completely sure there are things to change on third CI. For me, it should be the base model, at least for trusted. Because if I remember correctly, we did change something the last time on the Jenkins that we said that could be created, but I forgot what. True that. Okay, so we might have minor changes. Yeah, cherry pick. Just to note about computer and BMC plugin removal. So BMC did what they had to do with the GitHub trust safety. GitHub confirmed yesterday, really late, that it's okay finally. So Daniel, thanks Daniel, took back all the plugins and is putting back the plugin to distribution. No action expected for us folks. But let's keep an eye just in case if something is going wrong. The data center is working as expected. So let's continue putting the infrastructure in good shape and that should be really smart. Last week created an issue about migrating from service principle to workload identity, our Jenkins controller in Azure. So they should not require any credential. That one is in the backlog. I don't expect to be able to work on its next milestone. So I'm just putting it away. Do we have a new issue? No new issue. If it's okay for everyone, I will want to add the two issues on the upcoming milestone. Sunset, the robot Butler service. So it's four. So that service we use on, I think it's let's use on one of the OSU SL sponsored virtual machine where a confluence was running before it went to the Linux foundation hosting and then was stopped at all. And on that machine, it's edamame, sorry. And that machine was also hosting the former meeting notes. And that bot robot Butler was using IRC for both the board and the infra meetings like the one today. And they were using the bots to take care so the notes were taken on the IRC channels for both kinds of meetings and then were automatically published on an Apache server. We had an issue of a few months ago that we don't know why and how, but the meeting notes and the content of the doc route of that Apache was deleted a few weeks month ago. So everything was retrieved and put on a GitHub page and the board is managing now their own notes. So now that we don't use that, the DNS was changed to GitHub page so there is nothing pointing to that machine anymore and the Apache server has been shut down. So the goal now is to remove all the resources on Puppet and also on the virtual machine and GitHub so we can sunset that service definitively. That one should be quick. I tried an exhaustive list, so yeah. I'm volunteering, but if anyone want to help or participate on that, don't hesitate. Add yourself as I sign in and specify on the comment what do you want to work on on that list of elements. And finally, the last item I want to add is the Ubuntu 2204 migration. We started to work on that and that one is quite important. So just check Ubuntu 1804, a.k.a. Bionic. Why? My GitHub is absolutely frozen. Okay, better. The upgrade campaign. So I'm adding this to the next milestone. We need to work on that now. That's our next priority. Everybody did the EV lifting for the Packer images. So this week later, we will release first a new minor version of the Packer templates that will feature Maven 3.9.1 instead of 3.9.0. And once we will have deployed that version, the goal will be to switch the Packer image base from Ubuntu 20 to Ubuntu 22. That's the first step. Stefan, you will have some work on this one by creating a new set of machine for trusted CI on Azure that will automatically cover the upgrade of trusted CI virtual machine that are running on EC2 today. They will jump directly to Ubuntu 22. For the controller, trusted CI, the risk is low because it's only running a Docker container. So you don't have to care about the Ubuntu version. We might have some differences for the trusted agent now, but I'm not really stressed out by that one because most of the tooling is not Python or whatever we're tooling, it's only GDK. And we use Timuring. So that should be quite easy. That will be the next step. For Erwin I, there will be a few Docker images using Ubuntu 19. For instance, the VPN is one of them. Finally, the biggest one will be the Docker packaging. That one is already on Ubuntu 22, but the PKG virtual machine, the one that is always causing us trouble, the proposal we discussed last week on that meeting was to start with a Docker image for running the package port. The tricky area here is that that machine is risky to upgrade in place because it's a single machine and we will want to split that machine on different areas. Maybe the package port should be handled by ReleaseCI automatically. So for this one, we cannot upgrade from 18 to 20. That will break the release process. We need absolutely to upgrade to 22. By doing that, that would break all the Python and also all the Debian repository things. That's why if we can switch the release process script from whatever they do today to running on a Docker run Genkin CI infra packaging, we should have an 18 line that should be a first step and then we can upgrade to 22. And for the rest of the virtual machine, we should be able to run APT distribution upgrade on each. But let's start with trusted VPN packaging and Docker images. That's my proposal. It's not mandatory if you have couldn't a proposal or other ideas or don't hesitate. Okay, for you folks, that's all for me on what I wanted to add. Are there other things you want to talk about? Herve, maybe public, we can move back from backlog the public gates migration. Yes. Do you feel you will have the time to work on that or do you want planning updates eventually? Let's start weeks like that and I'll put it in the milestone if we have time. Makes sense. That's all for me. Is there something else to add folks? Very good. Okay, cool. All good, yeah. So I will stop recording and screen sharing and see you next week. So stop this one. Stop recording.