 recording to cloud. Okay, recording is starting. Welcome everyone to the Jenkins Infrastructure Weekly meeting. We are the 12th of April 2022. Today, we have Mark, Meydamian, and we have Gora. I pronounced it correctly. Let's start with a few announcements. First, we had the Digital Special Activities Sunday. More on that later. We are now sure that no arm was done except someone mining for five minutes before we cut all accesses. There will be a public communication later in the week. We are finishing the audit trial and feedbacks. But yeah, that's the information. Second announcement, the weekly as 2.343 has been successfully released. I launched manually the Docker image built on trusted.ci just to be sure we add the new version available for infra-ci as soon as possible. I forgot to mention that on Jenkins release. No worries. It should come automatically released for that part in the upcoming weeks. So crossing fingers, but everything went fine as far as I can tell from my point of view. However, I assume that Mark, you need or someone on the release team need to check all the boxes for the release list. Is that correct? That is correct. Thanks, Mark. The announcement, there has been a security advisory today, bunch of security issues and plugins. So please update the plugins already done on our public instances, e.g. weekly CIG and Kinsayo and CIG and Kinsayo already done also on the private instances on Kubernetes Forest release and infra. One last announcement, it's been two or three weeks that we started seeing random slowness on the service updates in Kinsayo. So I still need to write down a detailed issue on the desk that will explain the different thing. Right now, Mark and I are the person on Peugeot duty. We tend to restart the Apache service when it goes too far. The impact can be seen publicly with some cut connection when we restart the Apache. But most of the time, it's when the response time goes far beyond the 10 seconds and usual time. We have action points to fix that. The main one that I'm announcing now, but that should be communicated in a written way and public way later today or tomorrow, worst case, is that we are going to sunset the old mirror system, which is HTTP only to use and redirect all mirroring to the current production system running on Kubernetes. That means that the mirrors will be forced or redirected to HTTPS. So if you hear that and are using HTTP only, you will have trouble, but we are in 2022. So there is no sane reason to not use HTTPS for such a thing. That would be the subject of a blog post communication and email notification and maybe more. So stay tuned. But that's the whole idea of that announcement. Did I forget something or is it okay, Mark? No, that's great. Thank you for sunsetting that thing. I had actually forgotten some time ago that we've left HTTP running that had switched the mirror system completely to the much better newer mirror system. So it's time to turn it off. I agree. So today I will focus on the dawn and eventually the work in progress since two members of the team are ill and off, of course, the bandwidth was clearly lower than the past weeks. So I won't take any hard decision for the upcoming weeks. That will be a lose decision for the upcoming milestone. And we will then fill the gap on the next meeting. Still, there are a bunch of issues that have been fixed. So usual, I'm taking them from the list on the left. I just synchronized before. So account recover, we tend to have people losing their password. So we have to remind them the correct documentation page so they can autonomously get the account back or reset the password. Sometimes it's because they haven't connected since years. So they have to reset even their email. Minor one, we fixed some security header on Jenkins.io domain. We had to roll back these headers due to a bug in the engine x ingress a few months ago. So everything is detailed on the issue. And now we went back to the normal security expected issue to avoid some web exploits. So that has been closed finally. In the area of security issues, I did an emergency upgrade of the engine x chart and cert manager last Friday to fix open SSL CVE. We weren't able to do it earlier and there wasn't any way to fix it in the case of engine x ingress. So as soon as we add the Elm chart published from the upstream, we were able to deliver it successfully. The cert manager was a major upgrade. So we had to manually do some maintenance operation, but it's fixed. So thanks a lot for that help. We had the LTS upgrade last week as planned. That LTS has been applied in less than 48 hours everywhere. It took some time for CI Jenkins.io, though, because we had to wait for a big, big queue of builds to be processed. But no issue whatsoever for this one. An old plugin was archived. Thanks, Tim, for that. We had issues around repository permission of data not synced. I'm the culprit. I will discuss that later. It's related to the infra reports where I caused mayhem, but it has been fixed. Thanks for all the contributors involved on mentioning that and helping to fix. And that would help for the infra report as well. Yeah, now I think you ought to highlight the story there that we're moving something off of that trusted infra into infra.ci where more of us have access. So I think you're doing the right thing. That was just a minor bump. Yeah, so looking forward to your comments there. No problem. Yeah, let's go later. There has been a request access for the VPN from Il Defonso because he was the LTS release officer or release, I don't know what's the release lead. Did I just see correctly that Alex Brandis, not my fault, is the new release lead for the next version? I think so anyway. I like to see lots of people offering that role. Good. That's really cool. So that means we have incoming VPN access. The good thing is that thanks to Il Defonso we were able to update finally the last pieces of the documentation for VPN access. That's some part we missed. So I hope that you'll be smooth for Alex. Let's see. Thanks team and RV for spinning up the public instance weekly.ci.genkins.io which features not only being on the weekly release, public instance where we can demonstrate externally the new design library elements. So thanks folks. We have that machine that instance running with a minimalistic set of UI. That means we are able to spin up a fully fledged Jenkins instance on Kubernetes in less than two days, which is quite nice because it includes full configuration as code. So I'm pretty impressed by the work they did. No, and that one will upgrade to the newest weekly automatically through the regular processing. Great. Yes, exactly. Thanks. Synchronized with infracia, it's the same image for now. Another plugin archived. So thanks for the people involved on that because I don't know the rules for that part. Irvin, I will remove the last pieces of the evergreen legacy infrastructure because even though it should have been deleted, there were some databases and resource group on Azure. So not really expensive, but still still good to clean up. So it's probably safe for us to remove or somehow deprecate the evergreen documentation on Jenkins.io on www.jenkins.io then. Yes. There is a reference on the L desk issue that linked to the old issue where the decision was done by the Jenkins board back in time and reported by Tyler and confirmed by Olivier. So if you need a reference, that's the reference I was searching for since months and Irvin was able to find it for us. So thanks Irvin. There were two minor issues fixed earlier this morning related to changes on the docker accounts. So I will come on that, but these were caused of the Miami coast, but at least we were able to improve the security of our docker hub accounts. I'm going to describe that on the next section of work in progress. So that's still a lot of jobs. So thanks everyone involved on that. So now the main work in progress. First, we had to, in order to contact the docker open source program to ask for open source coverage on the Jenkins infra organization, not only the Jenkins one, we had to clean up as a prerequisite all the members of the organizations that we are using for different classes of images. Most of the time we had between eight and 12 seats used for each organization while we should only have three. So we have documented on the private one book documentation the new pattern. And now for each of the organization owned by the team, the Jenkins officer and the backup you manage case will be the owners, the corners. They must have their two FA authentication enabled on the docker hub. So they are not subject to credential stealing if they're in use password. And the third seat should be not an owner, but the technical user member of the organization who should only have read and write access to the images. That technical user cannot have MFA. So we are using API token for the connection, which had an additional layer of security, because the token is strictly scoped. And once you will be able to use the docker open source program, then we will be able to create token for read only and some for written writes. Due to these changes, I broke some of the docker image publication and I had to fix that earlier this morning. So thanksgiving for putting that because that made him blocked on his work, but finally it's there. What do we have? Migrate rating Jenkins IO. So Stefan was able to create the ingress and he was currently before getting a flu. He was working on migrating the database. So good job, Stefan. It's delayed to next week, of course. Time for him to heal. Docker open source program, I'm currently discussing on the official docker Slack channel with the manager of the team in charge of the open source program. We are going to send an email as the Jenkins infra, but we are giving them the details of the issue that we are encountering because it seems like we are edge case, especially with the pull push and security model. And so is your sense that they're considering what that means for their program or we'll just find ways to adapt? No, I think both. He clearly told me they might give us short term solution for now to unblock us, but he understand and it's a great use case for them to understand what the end user could do in the coming month is or what some end user are doing, but don't tell us. So that's mainly a product management discussion. And then I will issue the official request for extending the open source program to our organizations and users. So they are really interested. So yeah, that's a positive thing. What about the email alias for press? So I've contacted both Linux Foundation and mail gun and waiting for feedbacks. So I expect the new condition to expect them to create the mail server so we can move the MX record to their system. And in case of mail gun, I ask them if we can recover the accounts or at least if they can export the list of emails, given that we prove our identity through adding a DNS record or whatever security measure they want, because we didn't add any issue. So that one will be delayed until we have an answer and then I will add it back to the workload. GC for Packer. So now the development image garbage cleaned. So we gain some money, not that much, but great job, Stefan. I gave him the requested information just before the weekend. So when he will be back, he will continue on the staging and production images. Infra cost is currently delayed, mainly because we need them to update based on the feedbacks and bug reports that Hervé gave them. Yep, I forgot this one. Thanks, Mark. Monitoring builds on private instances. That one is delayed. Hervé and Stefan were planning to work on that, but since they are obviously not in good shape, delayed on one or two weeks depending on their bandwidth when they will be back. So no action done on this one. Same for Kubernetes upgrade to 1.21. So I will wait. That will be the main task when they will be back. On my side, I was able to clean up a bit the artifact caching proxy. So we have Docker image, which is published and up to date now and tracked by Update CLI. The next step is to find a way to test that correctly. And then the propose and validate with a real-life user of CI Jenkins IO what they think about that solution. So on the artifact proxy, you've dealt with the fact that we've got multiple clouds that we're using to provide compute or are you only caching for one of the clouds? No. At first, I want to provide one instance per cloud. So one on digital ocean and one on AWS for now. And see how it behave. That's the pattern I want to propose. One cache per cloud. So we don't deal with network latency, but we have to deal with distributed caching, which means a given build could take longer to build if it's scheduled on another cloud because it will have to re-cache. But that should be good enough for now given that only the released artifact will be cached. Excellent. Thank you. Okay. On that area, so I've asked for more details, but it looks like that there is what is called Nexus Client. I'm not sure about what is this because Nexus is a server for me. So maybe it's a local Nexus instance mirroring our Republic repo. And it looks like that this instance is causing a lot of requests on GFrog artifacts as they ask. I have no knowledge of Nexus instance on our side. So I asked them if they have a public IP or more details so we can track it. I assume it should be a great and big user of Jenkins developments or someone working a lot of Jenkins, but we need more information before being able to find what it could be and maybe contact them if they have a contact. So I'm waiting for a GFrog from feedback, but it seems like that the recent slowness on the repo Jenkins.org were caused by that big peak of error. So let's see. We need more information to be able to conclude or contact anyone. So if you have a Nexus mirror, a big one of repo Jenkins.org and listen to that, please contact us. We need your help to make the laugh of everyone easier. Migrate infra report from Trusted. So the mayhem I caused, sorry for that. The goal is to migrate the report generation, which is then used by regular jobs to ensure that the plugin maintainer have the correct write everywhere on all our systems. Obviously it's a sensitive area, not only functionally, but also because that's how we manage the airbag. And I caused mayhem in the sense that the new version was generating almost empty reports and overriding the actual production reports. So based on discussion during the weekend, I was able to reopen the pull request that was causing mayhem. And now it's creating separated reports. So we don't know if these reports are working, but at least the job is generating them and they are not empty. And now I will need help from people who have some knowledge on RPU to be able to compare and help me on that area. Without no answer in the upcoming 10 days, I will try again doing some mayhem by merging the pull request and see if it breaks anything. No worries on the states because the advantage of this regular task is that if anything is broken, I shut it down and I let trusted CI redo it again. There is no past state to reconcile, so it's quite easy to go back, don't need for backups. It's just some plugin maintainer won't be able to publish for three, four hours time for the system to heal itself. And finally, Irv and I are working on an issue on Update CLI, which is causing delay on the automated update system named Update CLI. We cooked an issue on the recent version. Irv fixed the issue, so it's time to release and publish. That should allow our home gave in to publish a new version of the plugin site in production. Okay, so I'm not sure I understood that last one. So there was some surprise in the in the role in the transition. Not at all. It's only that our Update CLI system see changes, but fail to open the pull request proposing the new changes, which make the production updates slower because we need to do them manually when we see the pull request failing. Got it. So it's a bug on Update CLI on the upstream tool we are using, but Irv was able to fix the bugs, so we're only waiting for everything to be published, released and deployed. Excellent. Thank you. Thanks for the clarity. No problem. So yeah, that's a lot of work in progress. We also have incoming elements. I'm just checking them, but I propose that we delay for next week, given where we have enough work in progress. Docker app credential for VM agent is blocked because we have Stefan and I to work together to put the current new credential since we rotated them all. So delayed. Migrate updates CI Jenkins.io to another cloud. So that one is an old issue already mentioning that. So the idea will be to find a way to avoid these three K per month of bandwidth on AWS for the update center is on. There had been nice and interesting discussion and proposal that were never pushed on the IP from Olivier Verna about storing that json file on Azure bucket. So at least we don't have to deal with a web server that could be limited it because we have to tune the TCP when there is a workload peak. So using web server from Azure could be interesting because it's self managed and self scaled that could be run on Kubernetes. So we can move to Oracle. And finally, Hervé was discussing with Daniel. I need to check back with them. Maybe we could also use Fastly to cache the big json because the uncaching port is still less than one minute while we generate the json file every five minutes. So there wasn't any immediate reason for not doing that that could help us a lot and diminish the workload on the dates CI Jenkins.io. But we need to be sure. So let's have Daniel and team just slow down from their huge week with the security advisory and ask them again next week. That could be a nice way to solve the solution because that will remove a lot of constraints on the web service for us. We could move it on Kubernetes with no risk on the bandwidth because Fastly will take most of the bandwidth alone. So I thought that there was concern that the bandwidth demands there might overrun our Fastly budget. But I assume Daniel is aware of that and team is aware of that. So great. The thing is that we can still ask Fastly because now we have real numbers to put on the bandwidth, which was an issue three years ago. At least it wasn't possible to measure that correctly. Got it. Thank you. We have an issue. Oh, I will add that issue for me for next week. Our Jira instance hosted by Linux Foundation has reached the underfly which means in six months it won't be updated anymore. So I need to request the Linux Foundation as soon as possible if they can upgrade to the latest Jira LTS, which should have end-of-life in 2023 October. Jenkins is fraud-occurring-image below-on-windows-image. The self-describing delay that should be done by Hervé. And the rest I minor issue a new idea. So I propose that we delay for next week because we won't be able to cover these issues in the upcoming days. That's all for me. So right after the meeting, I will update the next milestone, close this week milestone, publish the notes. So that's all for me. I don't know if your folks have questions, points you want to underline. No, thanks very much, Damien. Thank you. Thank you. No problem. On your side, Gaurav, you said you had some question or maybe things you want to underline. I'm available if you want to discuss that now. I can stop recording. At the moment, I do not. But maybe I'll share in a few minutes post-recording once your meeting is over. So I'm stopping the recording now.