 Hello everyone, welcome to the Jenkins Infrastructure Weekly Team Meeting, we are the 7th of November 2023, today we have around the virtual table, Damien Duporta and myself, Hervé Le Meur, Stéphane Merle and Kevin Mortins. Let's start his announcement, so the release 2.431 is out, at least packages, war, packages and Docker image. I'm not sure for the changelog, Kevin, do you have news on that part or? Yep, it's merged and on the site now, so live. Cool, I don't have other announcement. Eventually we will upgrade Kubernetes on the public cluster later today, most probably, or tomorrow if we see a last minute blocker. I wanted to do it earlier today but I didn't have the proper time to ensure if it will be working as expected. Public upgrade later today, packed on, get Jenkins, I go, etc. I don't have other announcements, upcoming calendar, so next week 2.432 of November, as usual. Please note that we will also have LTS.1, new release line, the Wednesday, November 15, so in 8 days. Marquise is the release lead, which means a bit of pressure for Hervé, if you want to have windows images for container published. I'm confident in my change. Absolutely, I am too, so I'm sure that will be very nicely done. The release candidate was done six days ago, the 1st of November. Haven't heard about major issue, something else to say about the new LTS folks? Yeah, for windows, not that it's GDK17 and not GDK11 like before. For the weekly windows, container image only, be careful. Yeah, sure. GDK17 by default, so that might require a changelog message though. Yeah, I'll add it in the documentation in the changelog on Jenkins. Okay. The same warning notes that I put in the previous. Perfect, I haven't checked, but just wanted to be sure that we had the proper glue between the bricks. So perfect, good job folks. Do we have an advisory announce? No, the last one was 25 of October, so no advisory. Yes. And what about the next major events? They've absorbed London 5 of 5 December. Tim Iacombe will be there for sure. And we will have Jenkins contributors submit on Brussels for the first time beginning of February 2024. Any question on the calendar? Okay. So let's get started with the work that we were able to finish during the past milestone. So congratulations for Bruno is now part of the GSOC team and he has the permission to do things. Yeah, congrats. I'm sure that will benefit us, especially putting the GSOC project on production. I trust Bruno's skills to learn the things in Jenkins soon. I think we will have really good surprises and we will have a nice outcome from this. Bellnets, the Bellnets is a mirror that we have disabled 2 or 3 weeks ago after one of our users complained that they were blocked when trying to reach the mirror. So they weren't able to download Jenkins releases. After careful checks and communicating with the mirror administrator, we had confirmation that an IP range was blocked on their servers because of DDoS abuse. But that blocker was years ago, 5 or 6 years ago. So they have removed the blocker. They will let us know if they have a new one. And of course, we have enabled the mirror again and confirmed with the user. So everything is back to normal. So thanks Bellnets for sponsoring the Jenkins project. Any question? Okay. Next one. So that has been a team effort to track and upgrade the Chideka 21 on the wall platform everywhere for every cases. The main challenge was that some architectures such as S390EX are still on the early available, aka preview, aka nightly builds versions, while some are already on GA, such as Intel or IRM. So we had to track properly because they have different life cycles. Sometimes you have an upgrade on the EA version, but not on the GA and the other case. So we have the latest Chideka 21 everywhere when it's available, whether it's GA by default or EA, if we don't. Nice work on this one. That has been a lot of tiny areas. So yeah. And don't know if you have any question on this one. Okay. Hervé, I'm sure remove Jenkins. Your page is not accessible and indexed anymore. So I believe this is closed. Yes. When I activated the delete option of BlobExpair, we've got many, almost every outdated link. I think we're returning 100 and 3 euros because the directory where these outdated page were contained only this page. So when they were removed, the web server was attempting to list the empty directory and return it zero. So by removing this empty directory, the web server now attracts to the not found page, which is better than pure error, not pure error, but server error. And for a good part of the documentation and documentation page, I've put in place some redirection. So users are getting a pure error page when searching for Jenkins documentation on Google, for example. Users are from old Google results. Nice work. Anything else to add on this one? No. Okay. Last issue we were able to finish is the planning for supported JDK version in Jenkins infrastructure. So that was removing JDK 19, which is not supported since March or April, if I'm not mistaken. Updating CI Jenkins IU online documentation to point to the GEP from Mark. So that has been done. Nothing else to add here, unless you have questions. Nope. Okay. We had an issue, like every week, someone tried to reset their password on account Jenkins IU, while they think about the Jenkins controller account. Looks like. They never answer, so I guess. Now, walk in progress. I'm going to update the priority order. So step one is update center. Just a minute. Okay. Bing. Il is not fun. Award on update center. Reminder. The goal is to finish the POC for an update center using mirror bits, which will delegate most of the bandwidth to a set of cloud flare buckets. We have one right now. That could also allow us in the future to have other other location for downloads. So we were able to finish and validate the script, the update center script for the parallel copy and mirror bits scan. So at the end of the parallel copy on all the location, then we trigger mirror bits scan. Timing is one minute 20 to one minute 50 with a lot of change. So that looks like it should be okay with the three minutes once we will integrate the generation. We were able to fix the HTTP 500 Apache erros. So now if you go to azure.update.genkins.io, you have a walking update center, which is not updated regularly yet, but it's working. So now the main, the two main tasks now, the idea is to hand over to Stefan on that part. And I will take some of the minor issue and Stefan will take the major, because I'm a lazy person, of course. First step is the Jenkins infra slash crawler. So we need to update that script so that it copies the files it generates to the update center mirrors. Because this is also served under update.genkins.io. This repository generate the Jenkins tool installer definitions and they are signed and integrated with the update center index. Two copy files to see mirrors. Then after this one, we will have to do a full end to end test to update mirrors regularly. So that mean merging the work we did on a pull request of the update center tool to the real life world or at least validate with Daniel that it work on a test case. The goal is to ensure that generation and copy takes less than three minutes for the UC. For crawler, we don't really care because it's run once a week. And once both of them will have been done, then we will be able to run Jenkins to new UC tests. So the goal will be to spin up a Jenkins controller in a container instance and set its update center to the new URL and see if it work. If we can download plugins, then we will have to test the Jenkins plugins CLI and other scenarios. The goal is to see if we have a functional update center. If it works as expected with the redirect, etc. Then we will be able to start writing a GEP to describe the changes. The changes with the POC as a proof. These are the top level major elements. On my side, I will have a few minor changes around credentials and variables. But these are not worth mentioning today. It's technical implementation details and nitpicking mostly. Any question or need for clarification on that topic? No, okay. Stefan, are you okay to work on this until Tuesday? Of course, yes, yes. Perfect. And Hervé, if it's okay for you, we will hand over for next milestone. So one milestone, Stefan and I, and then the next milestone will be either both of you or you and I. Is that okay? Cool. So move to next milestone. Next topic, RM64 agent. Terri, can you give us a status on this one? Yes, at all. The last development update is that the third manager on Datadog is running in ARM64 now. Yes. I had to do it to review my first request as third manager. I have three services and I moved only one, migrated only one at first. And for Datadog agent, Datadog cluster agent, I've added a node selector by mistake to the agent, while it should have been only for the cluster agent. So it prevented Datadog agent pod to be spawned on every node of the cluster. And we noticed that because Seldap was in, had triggered an alert on PagerDuty. At first, I was wondering why only Seldap and not every other service on X86 node. And it was because Seldap is using more advanced monitoring, like process monitoring, which the other services don't. Nice job. So what are the next candidates for RM64 migration? The plugin site backend API. I have to take a look at the issue, but next time in the last one, I think it will be weekly.ci.io or demo controller. Do you think that plugin site front-end? I don't think it's front-end. It's not a problem because it's served by web server. That's why I thought. There is one I don't know, plugin site issues. I don't know what technology is used and if it's RM64. Let me just take a look at the issue. Not now. The goal is just to mention that we don't know. There might be risk on this one. So if it's okay for you, you can proceed with API front-end and with weekly CI as soon as possible. Does it look good for you? Yes. Cool. Okay, that should be good for the next milestone. Is that okay? Yes, and it's the last one. The plugin site issue, we have already this image available and just as a note for later, but as Mirabits development repository has new activities. I intend to propose a request to build Mirabits to cross compile Mirabits. We will be able to have an official image deploying Mirabits in ARM. It looks like the author said before end of year a new release. So honestly, I don't think we should plan for Mirabits on RM64. If that's okay. I will put it post-bonit in the issue. By airway to upstream to have RM64 binary. Okay, cool. If you have, yep, sorry. Kelly Contributor mentioned that Tébian is providing an official Mirabits package in ARM64, but we won't use it. It was just a mention. Here we have for planning for the infra right now. It doesn't exist and is not supported, so we don't use it and we stick to Intel. Do you think it should be doable? And you can say no. It's an open question to start building images such as docker lm file or docker ashacop tools to both Intel and ARM64 and start scheduling infrasci agent on RM64 not pull. Do you think you will be able to start this work this week or should we just delay later? We can add it as a bonus step. Okay, let's say that RM64 agent has bonus step. Nice, not pull and then docker. Okay, cool. I also have something to report here. Falco to RM64, because we have demon set on Falco, not the main application, and it's failing and rebooting on public case due to issue. Still need Falco version bump. I plan to run that operation as part of the Kubernetes 1.26 upgrade. So if we have an issue on the cluster that will be due to the upgrade, maybe you can. Then be able to blame this new version. Exactly, Kubernetes is so hard to upgrade. Anything else to add on the RM64 migration? Okay, any question, clarification, Stéphane, Kevin? Okay, next step. I'm reporting about Kubernetes 1.26. So EKS done. Earlier today, the private cluster was upgraded. So for EKS, sorry, first I need to move RM64 to the new milestone. For Kubernetes, we still have the public cluster to migrate. That's the last step. Stéphane and I paired on this one. We wrote the issue we had, mainly on Amazon clusters. The system was refusing to upgrade the VPC-CNI. So that's the plugin in charge of maintaining the network inside the Kubernetes cluster. Because it said, hey, you cannot bump two minor versions at once. You need to do minor by minor version. So we had to bump manually from 1.13 to 1.14. And then Terraform took care of upgrading to 1.15. So the good news is that the work we did during the past upgrade are working. We can upgrade both Kubernetes and the add-ons at the same time on the same pull request. So this is an improvement since last time. But now we have this new one. That means we will need to set up update CLI. There is an old issue where we will use the AWS EKS command to retrieve the latest available version for the currently used Kubernetes version. Once we will have done that, we will have regular pull requests between the Kubernetes upgrades that will try to bump the add-ons version. Because for a given Kubernetes line, the add-ons has its own life. In that case, the add-ons would already have been upgraded to the proper version before the cube upgrade. So that's why we should have this. But it's written, and if we don't have time for that, we will have to know for the person in charge of the upgrade on 1.27 before end of share, ideally. That person will have to take care at least to upgrade manually the add-ons before... End of share? Yes, I'm a challenged person. I'm a challenging person. Yep. That's all for EKS. The rest was quite easy. 1.26 is minor. It doesn't have a lot of changes. We also upgraded EKS, the private cluster that went very well. Oh, I forgot to tick this one. We had two issues. One is due to me, I forgot double-codes. And the Magica of the Jenkins mChart, there is no validation of the syntax of the YAML files passed to the config maps. Unless you do your custom mChart like we do for job.dsl. I upgraded the GDK tool yesterday for the issues we mentioned earlier today. And I forgot double-codes on GDK 17. Thanks, Stéphane, for helping me on outfixing this one. Because once we upgraded the cluster infrasier, I restarted and was in error constantly due to that syntax error. That's an improvement on the Jenkins official and chart. And we should have our own mechanism to check that. So that one was a human error. Second one. We had issues and we accidentally deleted and recreated the public IP during the past upgrade on the public cluster. So we did a short-term solution by adding an Azure lock on the public IP so nothing, neither Terraform or Kubernetes can delete these public IPs. For instance, the public IP used for get Jenkins.io. You don't want to update this one every day. But we discovered that Kubernetes refuse to upgrade because the lock was present on the whole MC system and it was blocking the upgrade. It's a bit too much from Azure, but that's how it works. So that mean, in order to fix this, either we do like we did this morning with Stéphane, we remove the lock, upgraded the private cluster and then Terraform did back the lock afterwards. We didn't see any public IP change so that should work for public. However, last time, Tim Iacom showed a solution for us. We should be able to create a custom resource group with the public IP so we should be able to move this public IP here and by adding the proper annotation on the load balancer object on Kubernetes we should be able to continue working as it. That will put and move the lock on that resource group to see whatever automatically managed resource group to avoid blocking upgrade in the future. I would want to do this before public, so Falco checking the public IP by migrating the private case one because if we lose the public IP of private cluster, we lose a few web books to infrascii, that's not a problem. So I propose to test on a dummy public IP and then test on the private cluster. If it works, then we should be able to migrate the public IP of the public cluster prior to the upgrade. Does it make sense? Do you agree with this? We can even use the one that we saw which is not used as a dummy IP because it's already done. Yes, absolutely. We need to move the public IP to another and move the lock to avoid blocking project to be tested on the private case cluster. Any question here? Then later today or tomorrow depends on the time. Ok, looks good. So that means that Stefan should be volunteered for the next Kubernetes upgrade, right? Moldon, are you talking about the same Stefan than the one I think? Yes. What are you doing the 23 of December? I thought that would be on the 18 or 19 of December, let me think. I have something that day. Is it remembered just before your holidays? The day before. You upgrade and then you go on holidays. Have fun! So that's all for Kubernetes. Any objection, question, clarification, remark on that topic? Ok, so that one move to the next milestone automatically. A new issue cannot spawn Linux RM64 agents on CIG and Kinsayo. I was trying to check the GDK tools on Linux RM64 and I saw where there was on the logs and after waiting one hour for my agent. I retried earlier today with the reproduction step here on replay and I was able to go in real time on the Azure console and see the following message which say that kind of instance standard d4 whatever that we use for RM64 cannot use a disk greater than 100. Of course we use 150 GB by default on CIG and Kinsayo. I'm not sure of the solution yet but just what I saw. Decreasing the disk can have side impacts. Maybe we need a persistent disk instead of fmural disk. I believe we should check what is the setting on infrascii because infrascii still can spin up Linux RM64 agents. So I don't know if it's because it has different instance size, different disk size I don't know. Is there any information about the size that we're using most of the time if we never cross the 100 there's a good point. Yeah, everything is on Datadog and you did the work for the ephemeral agent to send their metrics. So I'm not bad point. So this one move to the next milestone. I'm removing myself by default Si quelqu'un est intéressé, il peut commencer à checker sur celui-là. J'ai écrit un note. Le disc fmr n'est pas élevé avec celui-là à l'instant. Pour trouver une solution, bien sûr, comparez-vous à l'infra-ci, où ça marche. Check datadog. Discusage pour agents fmr. J'ai besoin d'incrédits instantiaires. C'est bon pour vous ? Pas de question. Pour la prochaine question, j'aurais voulu avoir un morceau. Ok, on a juste une réponse de Alex. Je ne suis pas vraiment sûr de ce qui s'est passé. Quand les murs sont en place, ils auraient voulu être arrêtés immédiatement. Tout est bien installé, même si c'est manualement installé. Mais ça ressemble à ce que le S3, le système d'artifacte archivant, a tendance à faire un délicieux synchronisation. Ce serait peut-être un bug sur la plug-in, ou un comportement expectant, je ne suis pas sûr. Mais parfois, il faut attendre 1 ou 2 heures avant que les murs sont nettés, même si, comme vous pouvez le voir sur le screenshot, c'est plus comme un écart immédiatement. J'ai vu le timestamp, où les webbooks étaient en temps, mais ensuite, le temps pour l'artifacte archivant finit par être copié et netté. 1 ou 2 heures peut être élapsé. Ce serait possible parce que c'est un procès synchronisé. J'ai demandé à Alex pour vérifier si il y a un autre cas. Peut-être qu'on va avoir plus de détails. Je pense que ce serait bien expliquer que ce serait relative au plugin S3. Est-ce qu'il y a une question sur celui-là ? Je vais le mettre à la prochaine milestone. Je propose que l'on s'occupe la prochaine semaine, si on ne peut reproduire le problème. Le statut de mirrors. Ce n'est pas de travail sur ça. C'est un bon catch-folks 3 semaines auparavant. On a 4 erreurs. L'une des meilleures solutions serait de changer de OSUSL pour l'archivation d'Archive. C'est le default fallback. On devrait toujours avoir un statut HTML. On peut vérifier nos scripts synchronisés. Je pense que ce serait relative parce que l'âge de l'archive pourrait être plus ou moins que ce soit le système de rétention que l'on a sur OSUSL. Si c'est le cas, peut-être qu'on pourrait régénérer l'archivation d'Archive chaque week-end. C'est un week-end. C'est l'autre angle. Je pense qu'on va changer l'archivation d'Archive par ajouter un command t-dash-tml ou le toucher et voir si c'est synced pour l'OSUSL. Nous devons travailler sur ça. On va passer à la prochaine milestone. Stéphane. Le ton. Statues. J'ai commencé à changer beaucoup de tests pour les tests et les tests. J'ai préparé une nouvelle version de Linux pour pouvoir avoir les Windows et le commun. Pour maintenant, j'ai besoin de couper mes requêtes et de retirer le t-dash-tml que j'ai fait pour que le t-dash-tml soit bloqué parce que le processus de l'OSUSL a besoin de l'air avant d'évaluer ça. C'est ça, je pense. OK. Wip on migrating the Linux test the Linux test toGhost pool then update CLI Is there any question? OK. Let's have a look on the recently incoming issues. So we have an issue opened by Bazil Implement artifact caching proxy for MavenHPI plugin. I've disabled it. I don't know if you remember. We had an issue when using ACP for that particular use case. I've created an issue in the repository and Bazil wanted to have mirror issue on our repository so we don't lose track of it. So I'm removing Criage because it's a valid issue and we are looking on it. I don't think we have time for this now unless someone challenged this. I wouldn't... I would let it for now. So we let it in the repository. Is that OK for everyone? Yeah. Erwe, you open an issue about cyber bits. Can you let us know a bit more about this? While looking at mirror bits contributors I noticed one of them was running cyberbits.ui It's providing mirrors for open source software. So it would be nice to ask them if they are willing to mirror Jenkins 2. I haven't done anything yet on this issue. It's not on the milestone so no expectation for having worked on this. Do you think we should add it to the next milestone or do you want to delay for later? I propose I think we can delay for another week. OK. So I'm moving it to the sync next, OK? I wouldn't... Yeah. I wouldn't have used that milestone but we'll see later. Start a new repo under Jenkins for Jenkins contributors spotlight. OK, that one was opened by Chris. I I'd prefer to start working on this one. Yes. Moving for Chris. OK, let me add new issues. No, let's add it to the next milestone. OK. I believe we had a discussion last week I don't remember exactly are there any blocker? It's just a matter of starting to work on it. It's just a matter of starting working on it. Cool. So I did. Then I think that's all. We have this one still try edge. We can remove try edge. We had this one but it wasn't good first issue. No more try edge to add here. Let's have a look at the infrasync next. See if we have topics. Here. Cyber bit just moved here. We still have the pod restart on public gates. Not sure why. AWS decrease cost for summer. I need to close this one and move on the winter port. Not this week. Now. The status from sun grid. And the rest are classical. So if you are annoyed you have plenty of task to pick here folks. Just in case you are asleep. You'll sleep when you'll be dead. Oh no. Jokes aside I don't have other subjects. Do you have other topics you want to add to the next milestone to mention here? No. We can also finish early for this time. Cool. Ok. See you next week folks. Thanks for the work. Bye bye. Bye bye.