 Everyone, welcome to the weekly Jenkins infrastructure meeting. We are the 23 of November, 2021. So today we have more RV and high, the availability. So let's start with the announcement. Today, weekly release 2.322 is currently running. So not released yet. Let's see the next. So we had a minor issue that they would cover after the announcement, but that should be good before end of the day. Is there any other announcement? Okay, let's go. Let's start by, what is the issue with the weekly release? So thanks Mark for letting us know that there were a build failure during the weekly release. The root cause is the GPG binary absence from the agent where the release is running. So we have added in IRC, I will transplant on the notes the reason. The reason is because during the, we did an update of a bunch of our Docker images, including Docker packaging. And the release agent environment is using always the latest version. So using the latest, one of the side effect of our changes was to follow good practices on the Docker file, used to define that image. And some dependencies like GPG were not explicit. They are not defined explicitly on the Docker file. They are present on the image or they used to be because there were implicit dependencies of other packages. So we have to release a new version with this element. We are currently checking, it sounds like we need to add Git and open SSL at least as explicit dependencies, just to be sure. The reason is because the release shell script is using explicitly these comments. There might be other, but that should, we have almost covered all the script. Why not using the latest image version from last week? It's because it's using latest. And before that, we were not using regular tags and the previous tags is eight month old or something like that. So the risk is that we will use an image with outdated dependencies. So unless someone is against that, we should be able to deploy an image in the next 30 minutes. Yeah, that is no harm whatsoever. If we had to delay the weekly, the packaging of the weekly release a day, it would still not be harmful. So that's, I agree. No way should we go back eight months to a tagged image version. Just one thing. It sounds like we have a tag from the 7th of October on the GitHub repository, but I propose we don't spend much time and we fix the issue. This is a weekly release. It would have been a LTS. I would have proposed to find the version for last week, but I propose we go ahead. Is it okay for everyone? Yes, absolutely. Good choice. Okay, so let's wait after that meeting and continue the effort. Next point, Jenkins-Io outage. So last week we had an outage on Jenkins-Io and all the documentation, blog posts. So everything under that website. So we still have a postmortem to write down. The root cause was the NGINX Helm charts that we upgraded in production. And even though it was a patch version, it has a side effect due to security CVE fixed in NGINX. And we add to the consequence was once upgraded, it deleted the ingress rule for Jenkins-Io. And after sometimes it started to be uncached from Fastly. The CVE fixed on that Helm chart is some code block on NGINX are forbidden on that version when you are using some characters that correspond to configuration item in NGINX that could be used to extract secret or do a wrong thing like calling a Lua script or a simple if block. And alas, we had a hit block on the annotation. So thanks for the work on that part. You were able to remove or move away at the image level for Jenkins-Io at the container level, some of these configuration items and remove the missing ones that helped us to keep using the latest version without the security risk involved. So thanks, great work, but we are sorry for that outage. The post mortem with the details will be published. So the outcome is that we have to be careful even with patch version of dependencies. That's the morale of the story because I wasn't and Irvay told me you should and he was right. Thanks for that effort. Is there any question about that topic? So in terms of the post mortem, will you do it offline kind of written thing or offline online hybrid? What's your preference? So I would like to start again the post mortem routine we had a few months ago that will be during two weeks and ACAMD documents that I will write down. I plan to do it today and send it to the mailing lists, both developer and infra. We have two weeks where people could comment, propose, edit, ask a question and in two weeks without any comments or if everything is solved, we will froze that document into the genkin-sinfra slash documentation repository in a marked on format and updating status genkin-sayo. There is an incident in the history of incidents. So we will have the link to the post mortem on an incident for later. Is, does it seem good to you? That means collaboration during two weeks, question, comments, and then if everything is okay, then we can move away. Yes. I'm gonna add something to the bottom of the agenda. Sorry, I continue. Next topic, just a note about the TLS certificate for repo-genkin-ci.org that should be upgraded. We mentioned that last week. So since last meeting, we have done everything required to help KK. So he's able to take over the name and now he should have generated a new certificate or he should be able to. We haven't heard from KK yet and I was planning to send a reminder today and it sounds like G-frog roasted us and ask KK, what's the status? My proposal is to wait at least tomorrow if KK doesn't answer to G-frog who will send him an email either if you still want to do it or we can take over again and do it by ourselves and exchange it with G-frog if it's too much for him. Hervé, it's your topic, Kubernetes 1.20 upgrade for AKS and EKS. So I've read the different changes for this version and as far as I know, the only atonement point will be a default timeout period of one second, which has been defined for the import probes. Previously, it was already at one second as I could find in some articles from 2018 but this requirement was sent to the G-frog. It was respected, it was a bug in Kubernetes. So now this one second default timeout is on port, so we'll have to meet our application to see if some of them will need this parameter specified with an increased timeout, especially Jenkins, but we don't know yet. We'll have to test it, but apart from that, I've not seen anything specific or problematic for the upgrade. So I think we could maybe set the date for this upgrade this week or next week if it's okay for you. So, and if Thursday or Friday are viable for you, they are certainly low activity days in the U.S. because those are bank holidays. So it'd be an opportunity, yes. Now, no requirement, but just as if it's convenient for you, then Thursday and Friday happen to be bank holidays in the U.S. and I at least for one will not be doing Jenkins work on Thursday. So that may not be your call, select based on your availability. Usually with Olivier, we were doing that early morning or morning in the Europe. So we have a full day if something goes wrong. So, Thursday morning, it would be good. So if it's okay, then you will have to send an email on both mailing lists to less people now and open a pull request to prepare the operation on status Jenkins IO. And additionally, a fourth point will be telling telling that on the IRC and every instant messaging channel that we could use for infrastructure or communicating. Did I miss one more? Or do you have other location where we could? Those are great choices. Oh, community.jankens.io might also be a good place to note it if you didn't mention that already. It doesn't do any harm to say it there. And it's another forum where someone might see it. Ereve, are you expecting downtime or is this one that I don't recall Kubernetes upgrades? Do they typically take a downtime or is it rather it's a rolling thing where a piecewise upgrade usually works? C-Program will be a C application needing more than one second to respond to the probes. And I don't know how to test it. Okay, so we predict there probably will be downtime. Yes. Okay, well got it. And also we depend on the sum of like the LDAP is the most critical part on the LDAP. It's not scaled horizontally because that's an LDAP system. And since it depends on the persistent volume, it depends on how Azure will reschedule the container. So sometimes it work in a few seconds and sometimes it takes like 10 minutes to unmount the volume on one machine, mounted on the other and then scheduled. So the container is scheduled and waiting for the mount to be present with the data. And that one is really variable. So we should expect the cut. Most of the LDAP usage is like JIRA or CI agent in SIO have some caching. However, we should say by default it will be, we will have an outage of an expected, let's say the service will be done for a few minutes during the restart time. Great. Okay. Thank you. And yes. And also, yeah. As you say, the survey that will be the second risk is the if probably might block some services, but the other services are horizontally scaled. So we should, that should be okay. Like the plug-in site or others. Okay. Is there any, so possibly third of Friday next week. So beginning of December sounds good for everyone. Oh, so it's, you said Thursday, Friday next week. I thought everybody was saying this week. Two or three days. Did I misunderstand, Dervay? If it's okay for me, I don't see what, I just, apart from the short announcement, I don't see why I'm waiting. Great. So this week, and this week is great for me. Yeah, that's, that really is so long as, so long as I'm not, I'm not needed. And I don't think I would be any help. I could hinder if that's, I could get in the way and I could. That could be a great excuse if you want to get away from where the family diner, though, that's. Sorry, I recommend it to Regres. So, okay, so this week. Okay. That was me that misunderstood, sorry for that. I don't know, I can send you a link to your great notes. Okay. The next topic is just one mention. That's my main task for this week. I'm working on a new pipeline library function on the global pipeline library system that we're using. The goal of that library will be to centralize the make file, the, the, the, the, the make file and the pipeline steps when we use Terraform. So then the goal will be to use that pipeline library for our current Terraform automated job, meaning AWS and DataDog. Both of them will need some slight changes. DataDog is something created by Olivier and Tyler with Terraform three years ago, and it's still working, which by the way is really cool. And AWS is the new one that Garrett and I created this here, so the goal will be to, to have the same way of manipulating both because they have slight differences. So there is, there is interest in the open source community around the US federal government in Terraform automation in pipeline. So what you're doing, I may come begging you a month or two from now to help and coach on, hey, could this be made used a little more broadly even. Certainly you'll do it for our needs, but we may come beg for your skills or your knowledge to how could we help the, the open source piece of the US federal government if they want to use it. Sure, one pointer that Irving and I discussed also this week on that area, thinking about choosing the Ashikor Terraform cloud that could also alternatively take care of the sharing state and running these jobs because there is a free plan that should suffice in our case. And that could be interesting also for open source code that they have programs for open source if we need more. Irving, you told me that one of the limit of the free chair is the amount of users available to access that. So meaning the admin of the infrastructure that should suffice for us, but yeah, maybe we need a bit more. So that could also be interested to have in mind. If it's okay, I propose that we start keeping our items so we manage everything from scratch with the risk of losing our state. And then we can think about checking on Ashikor services because that will help maybe less code, everything. But since we already have some code running, the goal is to go step by step. Is that okay for you? Or do you think that maybe we could start already with Ashikor? I don't have strong advice on that part. So that's why I'm asking. Can't we import the current state in the Ashikor cloud? Good question. Good question. I think it can be taken because I don't know for sure. Yep. Ashikor, there are form cloud, import the form state that will be something to check. Okay, I'll take care of that part. Next topic, CIG and Kinsayo. So Maven3.8.4 and Github.34 are generally available. They are almost available on CIG and Kinsayo. We are having issues on building the virtual machine template, only the Azure with Windows. There is an issue on Azure with the replication of images and internal API error that I'm trying to get over. But for Amazon, all architecture, all OS and all the Linux on Azure as well, it should be okay. The reason why we went so fast is because the previous Maven3.8.3 is not available anymore for downloads, at least for the binary, which failed our builds. And we have the same for Git, but for Git it looks like there is some kind of LTS line and latest line, I don't really know. The previous version we were using, 2.33, is not available on the package repository on Ubuntu that we are using, but there is still the 2.25. I guess there is something to be checked there. We might have been too quick on using that way of installing Git. But I mean, it's not breaking the use cases on CIG and Kinsayo because we have a VM template, but yeah, that was to mention. So in that case, the idea is to provide this version as soon as possible to CIG and Kinsayo and send an email to the Jenkins developer list once it's been done. Does it sounds good for you? It does, yeah. Is there any question on that topic? Cool. One note, the campaign that Hervé and I did on trying to check all the Docker images built inside infrastructure for our own usages and taking care of the tracking dependencies of these images, depending on parent image or binary like Maven, the Git that we don't know. The goal of that campaign was to check that we were able to use the latest update CLI manifest version. And we have the five repository that have been listed in the meeting notes that are still to be checked or updated. Maybe nothing could be done, but we have to check them. We have on tracked, I haven't listed all the repository we did. The first one was the Elm chart. I've added a huge amount of work on tracking the chart version, the Docker images version. So each time we have Docker image used internally, it's automatically updated with update CLI, which is very nice. One last thing, just about the costs. So it sounds like we will be able to not hit the 10K limit on AWS. Right now, we have used the 6K and we should be around 9.2K this month. The goal, as a remember, we still need to be under the bar of 8K per month. So there are still some work to do on that. Just a mention, because everything was on previous meeting there, and we are working on this element. As we, as for a reminder, the goal is to move PKG Jenkins IOT to another cloud because it's 3K off bandwidth. So we have to check the cost on Oracle Cloud and Azure just to have a comparison on our backup plan. Azure costs are still the same, 8K, so that's cool. We are under the limit. Last topic is election voting system, Mark. So unfortunately, they had a hardware crash on the system that we use to run our voting system or that we rely on to do the voting. We don't actually run it, they do. Cornell University in upstate New York hosts that service and has hosted it for years for us. Unfortunately, they had a hardware failure. The professor who is aware of it said on his tweet, reply to me that they were working on it, but I haven't seen anything further and it is definitely still offline. So my hope is that it will come online. I don't see any way for us to replace the service because it's got the 40 plus votes that have already been recorded. So we're hoping that they've got good backups like Professor Meyers indicated they have backups and that they're able to restore the hardware and that the holiday weekend in the US doesn't, and it being a university holidays tend to be a little longer in the US than typical business holidays. So the danger is we may be offline and have to do something more in terms of extending the voting period for the elections or something like that. I'll bring it to the governance board for discussion if today by email. Okay, thanks for the checkup. Is there anything that the team could help by hosting a service or providing a machine that someone else will manage or? Well, if I'll check with Professor Meyers, I'll tweet to Professor Meyers to see what if there's any progress today. I doubt that that would help because we'd have to have access to their backups. We'd have to have access. The data is much more valuable to us than the service itself, right? The voting data for this running election is intentionally secret and is intentionally private. And we don't really want it to not be private. We like that it's private. That's intentional. Good point. Okay. All right. Thanks, Mark. I don't have other topics. I don't know for you if it's a few others. Done. Cool. So let's go back to fixing the weekly release then.