 Hello everyone, welcome to the Jenkins infrastructure weekly team meeting. Today we are the 25 of October 2022. So during that virtual meeting, hi Damien DuPortal, we have Hervé Lemur, Marc White is in holidays, Stéphane Merl is there, and we have... We have Stéphane Merl... Bruno Varton, sorry. Yeah, I need to sleep a bit. Okay, let me share with you the link to the share notes that could be useful. It's added in the Zoom chat. Okay, first of all, announcements. First announcement, the weekly... it was successful on release CI Jenkins.io. So if I'm not mistaken and we can check on the screen recording, then it's visible on the website. So the version... that version is available, at least the war file released, war packages. The Docker image is ready. I've triggered the build manually just before our call. So I assume that the last release item to be done later as usual, such as the changelog. Is there any question? Nope, cool. Announcement, we had a security release last week. Thanks for the huge work that the security team did on that one. That was plugins, only plugins. Thanks, Stéphane, for handling the dates everywhere. So we have been able to deploy plugins on our controller, except CI Jenkins.io. That was part of the security advisory itself. But for the other, in less than 24 hours, we had everything up to date. Thanks for that. It's a kind of business as usual, but it's still positive to mention it, that we can keep up to date with these elements. Security update last week, plugins only. Okay, do you have other announcements on your home, folks? Just one note, next the weekly meeting next week. Thanks, Stéphane, for reminding me just before the meeting. I will be cancelled because the 1st of November is an unwalking day for most of the European countries, as far as I can tell. So unless people in the US want to run the meeting, it's cancelled because the four of us there won't be available. I don't know if it's between states or if the US, if it's a work there in the US, I have no idea. So pardon my in-culture. It's a Christian old sense date. So the weekly meeting cancelled the 1st of November, which means we'll see each other the 7th of November. That will also, that's a reminder that the upcoming milestone will have to take an account that it will be a two-week milestone instead of only one week. Is there any question? Nope, okay. Upcoming calendar, next weekly 1st of November. So that will be monitored by our US colleagues if they are working that day. Otherwise, Mark and I will be the fallback as the default page of the default back in case of issue. Next LTS, I think it's the 5th of November, if I'm not mistaken. Let me check. I'm going to check on community events. There is a nice calendar here. Now, that would be the 2nd of November. That free. I don't think there is no expectation from us. No security release announced. And next major event, no next major event. Did I forget something? Or is it okay for everyone? Okay. Time to switch to what did we close or finished during the past milestone? A bunch of permissions issues or account recovery. I don't think it's worth spending sometimes unless you have questions. So deleted accounts as usual when someone request a password, reset or a deletion or permission always check that they are who they seek. Would they tell us they have. So you have to validate either GitHub account and or Jenkins account. Never do it only on that one. If you are unsure, please ask the others. There is no shame on that. Better to be safe. Thanks again, LV and Alex for working out the upcoming deprecision of GitHub action set output system. So now it's sound like everything has been done. Thanks for everyone involved on helping the contributor who lost their accesses. So we had chief frog issues last Friday. We had held up outage last Friday as well. And we had people who didn't read carefully the the instruction for plugin maintainer when they do a maven release from the machine. And we had a misconfiguration. That one is important to note in artifactory that led one of the documentation to be a wrong. Let me show you the page. Here we are. Okay, so on the page performing a plugin release manually. You have the UI way, which is the recommended de facto way you have to log in on artifactory, which is a good way to accept and test your password. Is it the correct password? Do you have access to artifactory? And then you have to follow carefully these steps. And when I say carefully, I mean it. It's not just, I mean, you read each word twice, each sentence twice, and then you follow up the instruction. Of course, anyone interested since October 1st, anyone interested on adding screenshots or helping maintaining that page that will be helpful. That just to note. So artifactory is able to generate the correct maven settings for you that you can download even with encrypted password. So you have to use it by default. Also, there is an API endpoint on artifactory that allows you to download that settings XML already prepared for you. That was the alternative. However, that URL was answering unauthorized for everyone, including the admins except one or two persons that were the initial creator. That was because the one settings was changed a few months before and no one noticed. So thanks Daniel Beck for fixing that. So now it's still, it's working again confirmed by the end user. Wanted to mention that because some contributor have issues, but most of the time 99% of the time we have to point them to that documentation and tell them to validate that. Half of the time they can't even log into artifactory. So that's that mean they have to reset the password on the LDAP and wait for the token to be synchronized. What did we have? So there were some non-infra related thing as same link missing for several plugins. So data center was generating unexpected data for six plugins, which broke the plugin sites for some of these plugins. So it has been fixed by a numerous people. Thanks a lot. Thanks Stefan for the huge work on CIG and Kinsayo. So we have the matrix. There is a new help desk issue. If you are a developer contributor, you want to ask the infra team for matrix about an agent behavior either because you had issue if you want to check nose. That allows us now to make assumption and we are going to be able to fine tune the machine based on the workload because we will have matrix for that. So really nice work Stefan. Next thing is that we cannot use Datadog dashboards because as soon as they are public, we lose the feature of selecting machines on the drop down and we don't have a search feature. So if anyone has is a Datadog employee or hardcore user and have a solution, we are all here. But it looks like we will have to build a custom application such as a graph and a dashboard that will be dedicated to only CIG and Kinsayo matrix particularly machines. That could be also interesting to see open telemetry things there because graph and as tempo or Jaeger support and we can even have big logs. So maybe starting a platform and observability platform only for CIG and Kinsayo that will be public alongside could be interesting. It's not the top priority for the team. But if anyone wants to go put their foot on that topic, I heard some people were interested as part of the fact of the fest, but that might span a bit more than that. But you are welcome. Speaking about, we removed the Loki installation. We had a Loki installation. So that's a log collector system like Prometheus that you can view through graph and dashboard. Loki was installed on our main AKS cluster public gates. It was broken since September 2020 when we bumped version from two to three. So major version bump. There's been a lot of changes, especially that Loki is now using operators by default for managing graph and connection. And it is now running as highly available system with a pool of read and write notes, which mean we have to set up shared storage and a lot of things. Since it was broken since two months, no reason to keep that system. It was already broken a few months ago. We already reset it. So we weren't not using it. It's not about metrics. It's about logs. We weren't using log collection from our system, which means we don't have log collection and we don't use it. So nice improvement or foundation for the observability of our platform in the future. Any question about all these tasks? Did I forgot one task? Don't think so. Okay. Let's move to the open tasks then. Upgrades to I'm following from the left and I will try my best to keep the order. So upgrade to Kubernetes 1.23. Stéphane Arvet, can you give us an update? Okay. So we did the three first ones, if I remember correctly. The OKS is done. The two cluster in the OKS, the OKS and the OKS public. That went mostly, if I remember correctly. And then we did EKS yesterday on Amazon. And that was not so easy. We had a few problems with the volumes with the CSI configuration. And we still have the Azure AKS to upgrade. And that one will be an UI upgrade. But still to go. We don't hear you. Sorry. Do we have a date time proposed for Azure or did you, or I assume you might not have had the time to plan it? We didn't plan it. No. Don't forget that we will have a long weekend and I might not be available Friday. So yeah, just as a reminder, I don't mind you doing that when I'm off. I do mind. You don't do anything tonight, isn't it? I got planned for me tonight. Sorry. I got planned every night. I'm a party boy. Okay. Thanks a lot for this work. Another reminder Azure. No, it's not as digital ocean is going to drop the 1.22 Kubernetes version from their system. So that means end of October. So next week, our version of digital ocean would have been unsupported. So that's the reason why we had to do that migration. Thanks a lot. Anything else on Kubernetes 1.23. So just a note, I will update, but it broke the new artifact caching proxy on AWS. So we are going to add an issue about the upcoming improvement. I will later be a list that coming. Yeah, but there has been an issue opened by Mark. I will add it. Okay. And that's all for me. So I assume we add this one for next milestone, right? Oh, I forgot to create the milestone. But is that correct? It's a two week milestone. Don't forget when you create it. Yep. Milestone. Team. Think. 2011. Oh, seven. Yep. Oh, eight. Sorry. Oh, eight. Yes. Okay. So let's add it to the next milestone. Do we agree where you're going to continue working on that? Yes. So just a note, the changelog, the task list here has been opened with everything we, all the issues and challenges we had. So I'm particularly proud of your work folks there because I'm sure next time that will be clearly improved. Thanks for that. Okay. Next item, artifact download failed on agent using repo cache. So that's the one that was caused. In part, in part. Ah, interesting. Yep. So this is good. Mark opened an issue because the build of many of the plugin I've installed activated the certificate caching proxy failed. The one put in the shoe was running on the AWS proxy. So, which was out down so it failed directly. So, as temporary measure, we removed the AWS from the available provider in the pipeline library function. The build plugin function. And now to fix that, we have many things to do in place. The first one will is there is a pull request open to to be able to define the available proxy provider with a global environment variable defined on the tankings controller. So, for Seattle tankings that I owe, we can easily disable or enable one of the provider if we have an issue or maintenance on them. Instead of modifying the outcoded value in the pipeline library. Dynamically, instead of. I will add ceiling to this. Cool. That was static code in pipeline library. Thanks for that one. You say it partially. Yeah, because I've also noticed since Friday, I've had I restarted some build because it failed. They received. 504. From the policy. I didn't have the time to activate the data dog metrics and log collection. I've added to the artifact caching proxy and charts. Okay. Enable data dog. Logs on metrics on ACP. Cool. And so one part will be to check what what causes error. And then on the pipeline library, I tend to to implement help sake on the proxy. For that I've merged a project request to expose the else pass with an additional address for the engines proxy without the basic authentication. So we don't need to have the credential in the pipeline library as they are managed by the config file provider again. Okay. And so I'll check I'll implement the file back. I can never be checking the proxy response responding. If not, using the to frog. Cool. May I let you transplant once you will be finished these elements on the issue that mark opened. Just so we have a note of the meetings and we have actionable items on the desk issues. And also something we discussed together I'm adding a maybe we could see if it worked by having two replicas of each ACP instance. And then we have the application of ACP. So advantage when we are so the fallback is the first thing when the service is broken. And the second one is a protection once a build started and starts using the proxy. If we don't speak about the five or four error that are only temporary I assume. If the build is running and at the same time someone start maintenance of the Kubernetes cluster where the current ACP is used. You want to have a load balancer that if one instance goes away it starts to the second one. So since we don't talk advantage better higher availability. Potential issue since it will use a different caching volume. We have exactly the same result, depending on which instance you are sent to. So that that will be a fine balance to find but soon it's better to have your availability for now and if it's a mistake then we can go back to that initial choice. Any question. Thanks for the work and the summary. Thanks to folks, Stefan also for the support on that area. We were, we were enough of free. That was a lot of help. So thanks a lot. Okay, so I'm moving that one on the next milestone. Oh, I closed the issue. The next one is update center 404. Did I did not receive the answer I'm going to extend my question on that topic so the subject was since the last LTS release. Someone has issue on one view of the update center is on file that I didn't know about that one was is or was named dynamic. I remember it might have been unsupported or not. I'm not sure about the status of this one. So I'm, I'm removing myself from the S&E just in case someone has some time to write there. But yeah, I will, I will try to ask further and find documentation on that topic. The person asked me directly nominatively that's why I'm removing myself just so they see if they ask help nominatively someone say yep, I'm not there even though I will, I intend to work on that one. The main thing is that we need to evaluate the criticism of this one because maybe it's something really critical. It doesn't look like but I wasn't able to have more information so we have to, we have to continue working on this one. Any question. Next one. Nothing done. Not enough time. Next one. So we had someone having issues. I think we can close the next one. So that person is using a product named Red Hat Satellites that looks like to, it's a kind of crawler that create local mirrors of the Euroma repositories, at least local inside the organization. That system was failing. Not sure why. At least, at least because some of the mirrors when downloading the Debian because it was first getting the index of the you package from Jenkins IE which work as expected. And then for each version it was downloading a copy of the day of the RPM file for each download of course they were redirected to one of our mirrors. So we're using on the items, the mirrors are not always the same, especially because all their items are removed from most of the mirrors except a few. So, some of the mirrors were still serving some of the mirrors weren't, and some of the mirrors were not on the firewall, a low list. So failing the communication when downloading the Debian files. Some say that the metadata's were also checked on each package final URL, which looks weird, but I mean, I don't know the product and it looks like they found a way to say always use the main URL for metadata's because metadata's are not cached on our mirrors. They are cached on the PKG machine, which is also pushed to people through Fastly CDN. Why don't we mirror this file because we need to be able to provide caching validation if we have a security issue that need to invalidate the package, which is a hard requirement. We can invalidate Fastly. That's automatic. Each time you change the package, the process for us for building package has the last item which is invalidate Fastly cache. We control it and it's quick and efficient. While in the case of mirrors, we cannot control the frequency on which mirrors are updating it. We could still add the metadata to the mirrors because we own the mirror redirector which has a hash of a given file. So if the metadata change, the hash change, so the system will be able to not serve requests to the mirrors that are not up to date. But this might still be an issue if someone decide to select one mirror. They don't have the guarantee that's the correct file. And that is the proof that people are not doing that. And the reason to keep metadata's while the file are not changing because we have a checksum for them. So I'm going to close that issue after the meeting with a message. It sounds like it was closed because unless you understand something else, it looked like the user was able to fix everything on the whole, right? Yes, it sounds like that, yes. I will update it after. To be closed, reintroduce artifact caching proxy. So, Hervé did us a nice definition, right? We had the next steps. Let me clear that milestone. So Hervé, you continue working on this one. Let me add a reference from Mark issue. So it looks good. Is there anything else to add about the artifact caching proxy? So let's continue. Another issue, I assume that should be closed. Oh no. Someone complained about one of our mirrors being too slow. I asked that person to add the query string mirror list, which shows you on your web browser what are the mirrors and which one is selected in your case because you might not have the same result as that person depending on where you are in the world. The mirroring use GOIP database. And in that case, that person is in India. And the closest mirror, geographically speaking, is in China. However, maybe it's closer, but it's clearly slower. So we, in short time, we unblocked the user. No, so no criticism there. They can use the other URL there and clearly they were faster. So that person was able to download with an acceptable rate. Thing is, we need people to provide mirrors, right? India. So we could create one on digital ocean since we have new credits. And we had also an issue. People asking for information. That's another upcoming tasks. So they have everything ready to set up a mirror. Don't know if they are going to answer because the free last person asking for that never answered back. So I propose that we close the previous issue in favor of that one. I got the thumb up on the last one. Okay. That's a good news. Okay. So they acknowledge your answer. That's, I'm happy to add that. So closing in favor of, so another issue to close. So next issue. Archive the future components. I missed this one. So you have no idea what it's about. Yeah, I was, I failed to ask you if you had the mission. If not, we can. Yes. So. Okay. No problem at all. Thanks. So then, if it's okay, I will ask someone to bear with me on this one to do it because of share admin. Otherwise, I'm sure team and Mark and Daniel have permission to admin. I'm not the only one. If any day I'm not available, but yeah. Oh, I propose we I think the share admin this issue. Yes. Yes. Good idea. But we can still, we can still do it for Alex. Thanks a lot. Next step, next issue. We had an issue. Windows ACI. So the docker windows agent with GDK were broken last week when we deployed a new version of the images. A git was absent. So it was absent from the path because I messed up one of the changes. So the root cause was was I think I mentioned the root cause we did some change and it broke the image right. The root cause was you. Yeah, but it's a little wrong because you just changed the main images in the image in the docker file and the one doesn't have the path within the path we did not have a git. We changed the base image. Yeah, we ended up, yeah, the path was incorrect as Jesse mentioned. We need some kind of acceptance test on this one or at least functional test. So that's why I'm keeping that issue opened because I would like to add in the build process for this image, a step that say okay now that the images bill run that command that command to validate a set of minimalistic expectation that we could have from any docker in bond agent, such as executing git, git should be there. Java version should be there. Maven should be there and Java default version should be the expected one, 11 or 8 or 17 in the future. So that's why this image is still open. I'm keeping that one also for an upcoming thing. So I'm adding it for the next steps. Stefan you mentioned you could have been interested on testing docker or is it okay if we pair on this one. Yes. I will give you some pointers so you should be able you should be free to try it on your home. So sorry for the inconvenience. The main reason was us updating git version. Okay. Next step Jenkins mirror so that's the one you mentioned. So the requester acknowledged waiting for them so I propose that there is no action on our sites I propose that we remove the Jenkins mirror one. We remove it from milestone and we wait for them to give us insights. Objection. Okay. Can table to get email for the password. Okay I missed this one. So sorry for that. Oh no that's one that's spawn since two weeks. So I propose as we said last week we close it without any feedback from the user looks good. Yes. Cool. Got to do it afterwards to be closed. No, feedback from a user. Okay. Next one. So I'm removing. Okay, I'm not removing this. Next one by plan step dog generator and back in the extension indexer. So was able to add the require tool on infrasci and test tested successfully one of the two. The second one failed because Java out of memory exception. Related to the GDK 11 version we used. That's my best guess because looks a lot on the one of the issue fixed on the latest Timurin 11. So next step for me is trying with the new agent because it takes three hours and instead of archiving the files. And that's the current status because it's done in CI Jenkins you which made them public. I will have to switch archiving to publish. We have a special pipeline library function that takes care of publishing inside reports the Jenkins that you web server which is publicly available. So that's one line change. And I will retrigger the bills and that should be okay. So I'm adding it to the next milestone. It's a kind of background task for me because it's since it's three to five hours build. I mean, I cannot spend all my time on this one unless someone is interested. But yeah, the next step will be as remember moving the build of Jenkins IO to use this new URL and report Jenkins and then see I Jenkins IO won't be required to be the Jenkins IO website. So here we will be able to move the Jenkins IO website generation away from trusted, at least the Chinese one, and we can move it to infraciate to fix it and have it updated. So, yes, tools. Okay. But one of the bills. Publish artifacts and reports Jenkins IO. And try new JDK 11 to avoid OOM. CI Jenkins IO job stories is not on the link pull request. So I didn't have time to spend on this one, but we were able to underline what is expected by a given for this one. Depending on what we want for an user we want CI Jenkins IO and infraciate Jenkins IO to do different things. So now the next step is Jenkins file writings we have to Jenkins files to pipelines that we need to configure. Only one is interested in working on that one. If not, I will take care of that on background as well. Okay. So I'm moving it moving it to next milestone. Requirements defined. Implement pipeline writing. Realign Jenkins CI Jenkins organization. So spoiler alert to do nothing done this week. Sorry for that. We don't have time. When those agents are so slow. So for that one, we have updated the we have updated the ACI agents. So we discovered an issue with the data volumes that use for Maven wasn't the data volume, but it was written directly on the container layered file system, which performance in IO are really poor. So that might be a partial improvement. The next step is to check the status of the windows virtual machine that we create for agent on both Amazon and Azure. And we have to check how to and to check if the IO system caching is enabled on Windows. Sounds like it should be a Windows command line to enable something, but that might be cloud related. So we have to check for both. If it's Windows only, it's one line to add on the package image provision or shell script. If it's clouds related, then we have to define for each cloud which kind of instance and which option it is when we create virtual machines. I propose to move it to infra team sync next. I don't think we will have time to work on this one. Please feel free to remove it or add a message and we can remove it if we start working on this one. I'm removing myself from the ACI. Any question back to backlog. Unless we have time to work on it. Key clock performance horrific when looking up. So Stefan and high plan to work on it today, but we were fixing the issues with the guest or the team. So that work has been delayed. So I propose that we move it to next milestone and we will find a time to work on it after finishing. But at a sub grade, is that okay for you? Yes. Because that one doesn't require me. There is a first step of retrieving information, preparing the plan. I mean, as soon as Mark or everybody or someone else than you, Stefan is able to validate the plan because always pair. Then you can proceed. You already did that. It's quite, quite easy. We'll do my best. I mean, there is absolutely no need for someone for the part of creating the new database on Terraform Azure. You can already do it and you can start getting the dump of the actual one. At least this part that absolutely in secret asynchronous, the synchronous part that require all the team to be aware is the day when we will stop key clock. Then we are migrate the data, stop the database on Amazon switch key clock deployment to the new database and see if it works. Jenkins release on RSS to Twitter. What's the status for this one? I have created the m chart to host this application on Prog public k8s. I have to fix it and then I will be able to deploy it on Prog public cluster. I have prepared a pair for that. Cool. Nice. Almost there then. Thanks a lot. So we can move it to upcoming milestone. Is it okay for you? Yes. Okay. Now, new issues that we received recently. I'm taking the most recent one. Add the reminder about Jira notification email address. Totally missed that one. It's not in command from maintenance to you. It's only triage that I'm doing here. Add a reminder to the account that the Jira. Okay, so that's a kind of feature request for the account, right? Yep. I think Danielle opened it here for having more visibility. I think because not many people are following the account. Yep. Okay. So I removed triage, but I don't think that, I mean, it's not a priority for us. Better spend our time on moving away from account, right? Yep. Okay. Next issue artifact download fail. So that one is already tracked. Password. Okay. Please. Okay. I'm adding that to the milestone just to keep track of it. But yeah, I think that would be the same. I will ask the person to provide more information and prove who they are, unless someone wants to take care of it. A new issue that came during the previous. I'm thinking just thinking about cloud. We might want to add an issue template for Acquand Recovery. With a checkbox checklist. I have made that I have provided the cyber everything. Because we have quite frequently this kind of issue. And we are asking the same thing every time. So. Yep. Good idea. No action for infrared team for now. I refer for account. Correct. So let me open an issue. Yep. To be added to new milestone. Please add an issue to add. Template for account password recovery as per every idea. Okay. So an issue that came across the last platform as I do meetings. So we don't have. Power PC 64 machines anymore. Cleanup has been started on the official Docker image. We could have an alternate solution in the future. But nothing on the upcoming weeks. So the goal for us is to clean up any mention of the PPC 64. System on our infrastructure. That one is going to be added to the next milestone. I think anyone is absolutely capable of doing that. Can you. Myself please. Yes. We spoke about that. I just forgot. No problem. So I created the issue to track it. I have listed. Why does it has to be done. I might have missed something, but at least this element will be interesting. So we have an agent definition and see agent Kim sayo to remove. The associated credential has to be removed manually on the UI. We have to update the tool definition. And the update automatic updates, because the tool definition for GDK says, oh, if it's Linux, that's the view, then use this one. If it's PPC use this one. So you have to remove that case and update CLI when checking for new GDK version, check if the PPC is available before proposing a pull request for updating. The goal is to update the same for everyone. We had documentation, I assume. And so then if we afterwards any other mention. So that one is. I did. Do we have new other issue. That one is going to be closed. We created a proposal. To add external DNS to the Kubernetes cluster with an ingress. So we would have a solution when upgrading load balancers to have automatically the artifact caching proxy being up to date. I'm adding that one to current milestone because I need to report the experimentation with it today. Some fails, some are working. It's just a kind of brainstorming issue. Once I will have my reports, then I will remove it from the milestone unless someone want to work on it. The goal is always have DNS that update itself as soon as possible if there is a new load balancer public IP. That's the goal. That will that be useful for the part where we choose to have elastic IP external IP that we keep the same. That could be a solution. Okay. Reports. To be. Updated with the IP. Experiment. For DNS update. Okay. Do we have other issues there? Your components that one is removed. Okay. Infra team sync next. No, next to portal. I think that's all. Yep. Nothing else for me. Is there anything else for you? Okay. Thank you. Thank you. Thanks folks. So see you in two weeks. Thank you. Bye-bye. Bye-bye. Bye-bye.