 Hello everyone, welcome to the Jenkins Infrastructure Weekly Team Meeting. Today we are the 7th of June 2022. And today we have Hervé Le Meur, Haïdemain Duportin et Stéphane Meur. Tim and Mark are at the silicone this week. It takes us if I'm not mistaken. Let's get started. So the first announcement, the weekly 2.351 is not released yet. We faced a set of different issues. First set of issues that we might still have some we will see during the packaging phase. So we reached almost the end of the release phase that takes 12 hours. The first set of issues come from infrastructure changes. Because Hervé and I tried to marginalize the configuration between aFRACI and release. But it appears that there are some subtle differences between the two underlying Kubernetes clusters. And we were beaten by these elements. So that has been fixed. We might have some related to the way windows container are scheduled. So we are looking into it. No worries. If it fails, it's only the packaging phase. So that one is easy to retry. And it's quite fast. We had some infra issues. Kubernetes Jenkins configuration. So that's literally the way pod agents are scheduled. And there were also some, I won't call these issues, but more, let's say, first steps. Since the image docker packaging was upgraded to use Ubuntu 22.04. One of the main changes that was from the 18.04. So that's two main LTS jumps. The reason is that because the version 20.04 was missing some tools that we need for packaging Jenkins. The main change is that we changed create. So Tim and Basil worked on changing a tool named create repo to the new version create repo dot c. Sorry, underscore c. And that was the main core of the change. That tool is used to generate RPM repositories for you. Redats on this galaxy of this Linux distribution. When you are on Ubuntu or another distribution. So you don't have the human style, whatever SDK to build RPM repositories. So that tool is a way to build this on something else than the redats galaxy. Redats. Are related to open SSL. And she did catch changes. Let me write this down. So open SSL, we are using on that latest version open SSL free dot au dot au dot whatever beta alpha. I don't know. And that one uses stronger ciphers by default. As for today, the certificate that we use to sign the Jenkins release are not compliant with this latest cipher configuration. So we had an issue and we had to, we are, we have to use for now the legacy mode of open SSL. We might have some long term fixes, but that's one of the issues. The other one is that we were building Jenkins with JDK 11. However, we still had JDK 8 and team removed the JDK 8 to decrease the size and simplify the image. But in the process, we forgot to update some variable on some templates that were specifying the full path to the Java to be used by the agent. The reason is because we had JDK 8 to build Jenkins and JDK 11 to run the agents on the same Docker image container. That's why we had this setup before. And now we don't need. We have JDK 11 thanks to the work of baseline team. But we had, we forgot about that thing. So we removed that setting and we use the default Java on the container. So now we have reached the release and let's continue to watch this release. So that's why there has been a question from Alex, not my fault on IRC about could we stage the changes. That's something we're really trying in the past. So for the sake of sharing that knowledge, most of the time the effort to create a switch that say, if it's not a real release, but a pull request or a staging, then you have to build sign but do not deploy. The complexity of such a code is to maintain, to build, to test itself is really risky because it will mean releasing and deploying a new release. Compared to the fact that we have a weekly and the weekly can be run multiple times per day. Each time it creates a new release that is exposed publicly, but that's not really an issue. So it's a kind of better and easier for everyone and faster to test in production in a real environment, especially the signing part is quite sensitive and it's hard to test it. So that's why we chose for that one. However, we still had a topic with the stage with the from the security team that is there since at least three years, I think, about being able to stage the release. That will mean running the release one day before the real life release, like we would want to build and prepare packaging the Monday, for instance. And during the Tuesday, we only have to promote the risk publicly. So that's a high level idea. That might create some interesting challenges to solve. The Docker registry, the easiest one. But that mean creating temporary registry for Maven pushing the war and metadata on that one. And then promoting that one publicly. The most complicated case as identified by Olivier last year is about the packaging, generating deb, RPM or sews packages. On a staging environment that should remain private could be complicated because you have an index of packages that need to be updated. So maybe we could use different file system different services, there might be solution. I don't say it's impossible. I say it's not that easy. That's why we never had the time to spend on that one. Is that clear for everyone? Did I forgot something? Okay. Do you have other announcement? Nope. Okay. Let's go. Let's start by checking what was done this week. I'm taking them on the order they are presented on the closed issues. That's weird because the order is not kept between open and closed issue. I don't understand. But once an issue is closed, you cannot change the order. So that's why it's not about priority. Enginix 1.22 campaign. So in fact, that was done faster than expected. And I realized why checking the latest changelog from the ingress controller. The latest stable version of the Kubernetes Enginix ingress controller, at least the community one, is still on Enginix 1.19. So that means I might have been too quick on pushing forward on that issue last week. But it's done and it's stable. It has been done on all the use case we had. So nothing else to say. Is there any question on that topic or things not clear? OK. Build our home Docker images. Congrats folks on that huge work. So now CI Jenkins IO, when running a plugin build on Windows, uses Jenkins CI and Freikustom Docker images inherited from the official Jenkins inbound agent but built on our infrastructure. That was a huge work involving a lot of code. So congrats Hervé on being able to deliver that one because PowerShell can be painful sometimes. Just a note, while working on that part, we were able to break infrascii because we tried to schedule container during configuration change from infrascii Kubernetes try to reschedule that container to a Windows node because we missed some scheduling elements. The consequence is that we lost all the data of the data volume because Kubernetes tried to mount the data volume on a Windows server node that tried to run a disk check on that one. We didn't realize that and killed the pod which has the consequence of force unmounting during the disk check. And that thing totally made the content unavailable. We could have tried to recover the data by mounting the data volume to a temporary virtual machine but we went ahead and recreated from scratch which is not a problem per se but we lost the build logs including the report generation. So for the future we need to be careful on that area that will be a topic around being able to back up the data of the private cluster using Velero or something. I think we have an issue for that. That's not the top priority right now but knowledge is shared on that meeting about that. We were able to update the configuration and the constraint for scheduling infrascii so now the elm chart won't, will never try to reschedule on a Windows node. So thanks Hervé for all the hidden work on the pipeline library. Now you're an expert on the groovy shell library. And we can go forward on the images. Is that clear? Did I miss something? Is there something else you want to add? Any question? Okay, next topic. Thanks a lot Stefan for the help on the update center certificate which was going to expire for the 14th of June and that was blocking any update center and crawler builds so safety mechanism. We were able to get help from Olivier who is one of the owner of the CA key so only the three persons that are KK, Oleg and Olivier are allowed to get that key to sign the new certificate. He did that and uploaded it to trusted CI to help us, so many thanks Olivier. And Stefan and I were able to put a bunch of documentation, fixes and tests. So everything is green and working and documented and we have a calendar alert for next year. So thanks a lot Stefan for the support and for putting all of these together because a lot of things happened at the same time. No question, nothing to add, something unclear. Okay, next topic. Use Docker instead of IMG by default. So thanks Hervé for that. So now all our Docker images are using Docker by default or specifically which means they are built under virtual machine with the Docker engine instead of being built by IMG on a Kubernetes pod. Builds are faster but we use ephemeral machines, virtual machines instead of containers. So what we gain in faster builds and tests we lose on spinning up virtual machine that takes one or two minutes compared to the few seconds for a container. Any questions, things unclear? Okay. We had an issue from Alex about IMM agents, IMM memory, virtual machine agents and CI agent Kinsayo that were stuck. So there were different causes. We fixed what we could by switching the retention policy of the Azure virtual machine agent from let's wait sometimes before cleaning it once it has been used by a built to delete it as soon as a build finished with it. That was already the case for the EC2 agents. However, there were a set of word settings on the case of EC2. There was a timeout to wait before and the policy once and on Azure we weren't even using that policy. Now both clouds are using the same policy with no timeouts and it looks like that it's working. I cannot be 100% sure, but at least we didn't saw some bunch of IMM machines in a word state waiting for minutes. That should allow us also to provide faster builds, less retention and decrease the cost. So let's see if you have an issue on that part but please raise an issue on the helpdesk GitHub tracker. Do you have any question for this one? The release last week that failed, two causes. One because the release happened Wednesday after we merged some requests on the Kubernetes management. So that was temporary. Usually we tried to avoid to do this. But that time we missed that there were Tuesday windows and started our infrastructure during the Wednesday that happened. Second one is missing MB command. So that's a consequence of my work on mirror brain port. Thanks Hervé for helping me on fixing that and thanks Olivier for putting all these scripts on the GitHub repository. So we were able to fix it. It looks like it's working. Let's confirm later today. That means I have more work for cleaning up the mirror brain machine because there is a bunch of scripts that are not tracked by puppets that were removed so we need to clean it up. There is an issue for that that has been opened. Thanks Tim for the last task. I'll see Cloak for Libera as requested by Alex. I have no idea how it works. So thanks a lot Tim. Any other tasks that are closed for you or that I forgot? Any questions, things unclear on that section folks? So let's go ahead on the work in progress. I've tried to put priorities on the work in progress. Let's try to keep that priority on the new milestone. I'm taking a screenshot just to be sure. OK. First one, docker update limiting. So I've closed the open source program associated the whole issue because we are part of the open source program. And CIGNKINSAIU is not actually CIGNKINSAIU is not at risk now because of the API rate limit for the agent. But now we still have rate limiting because the open source plan does not automatically grant our accounts to a professional paid account which means we still have an API rate limit for the official docker base images. For the all the docker official GENKINSAIMAGES that we build the base image from the operating system Alpine, CentOS, Ubuntu, etc. And we are rate limiting for these images. So we are in discussion with docker we are waiting for them to apply a team plan that should increase the thresholds. I've put a set of short term solution that would help on the issue with each one as its own pro and cons. So that's a summary of what we discussed during the past weeks and months about that topic. But yes, for now we are quite annoyed. Just a note I realize that there are a lot of tests that fail because they used images and they rebuild images while the images are already rebuilt for some to end testing. And that could be improved that should improve the rate of succeeding stages. That's a specific pipeline stuff on the project. So right now we still have that thing and we are waiting from feedback to docker. If we don't have any feedback end of June, then we will have to act and find another solution. Any question or things unclear on that one? So if it's okay for you we'll keep that one on the next milestone. The reason is that I should have news from the folks at docker and I want to if anyone is interested to deep dive on the testing process for the images that we call improve. Next topic bootstrap terraform project for Oracle so thanks Stefan for the work on that one. That was required for the Migrate Update agent in Saio to another cloud. The status is work in progress for bootstrapping the terraform states. So correct me if I'm wrong but the status is that now you have the states that are created on Azure buckets. You are able to manage technical user on the API key through the terraform state private project. And you are working on now having a terraform output to generate the subsecrets so we can bootstrap the empty project on Nifra CI. So that should be done before this weekend given the rate of work that you are putting on it. So unless I'm missing something we can put it on the next duration. Any question on that one? Okay. Next one digitalism sponsorship first of all thanks for checking the costs that was worrying that the May billing was bigger than the amount of credits that we have double checked on the detailed invoice and in fact it was just the note pool that we deleted the first day of May. So we are okay we are consuming 10 to 15 bucks per month now and the actual real time billing view on the digitalism console seems to confirm that. We have consumed less than $2 for June. And now Erwin and I we have to take appointment with digitalism folks discuss with them for the next steps. So that's why I am adding that issue to the next milestone. Sounds good for everyone. Upgrade to Kubernetes 1.22 status was upgrading kubectl that was the first step of the process. So I've opened an issue. Update. So the work that Stefan did is good. Because we have an automatic pull request I'm putting that on the screen that propose to update to the latest one. So we can merge that one but I've put the comments we cannot test it for now but we have a docker elm file issue. There has been a change on eks elm file combination of these three on the latest stable version of that docker image which failed the check for eks I've put the tips on that area we have to fix that it's okay for aks and digital ocean but not on Amazon sounds like something on a kube config that we should have to fix but I haven't had time to dig and that one is blocking the Kubernetes 1.22 So for the person who want to take the Kubernetes 1.22 next release you will have to work on solving the issue I will do my best I'm trying to go back to the issue yep you are assigned to this one do you think you will be able to work on it next milestone yes remove img at all so now by default we use docker but we still have img tool defined and used so on the pipeline code that I'm not sure if it has been cleaned up or not but that we will have to and on the docker builder I think img yeah I have been too too close this issue cool so it's work in progress but almost there and if I understand correctly we can put that on the next milestone am I correct yes just a reminder for the two of us survey once we will have finished that one so we will we will have to synchronize with gavin about the future of the docker builder image because that image was initially only for building and testing docker images with img cst and the docker tooling that you have moved to the virtual machines and then gave in use that image as well for npm packages and stuff that are used to regenerate the preview some websites such as genkins.io so that means we might want for now to keep that image but focus it on npm and ruby installations and tooling that means we should be able sorry we are now the ingwinner engine for example completely different I'm not sure I'm not sure but I understand it's only for the preview website from the website folder in infrasy area that's the only usage so we just have to be careful but that means we should be able to remove all the gh and cst and adolin tooling along with amg so migrate update genkins.io to another cloud stefan is it okay if I put that to the next ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?