 Okay. Hello, everyone. Welcome to the Jenkins Weekly Infrastructure Team Meeting. We are 11 April 2023. Today is around the table. You have my self-time at the portal, Erwe-Le-Meur, Marc-Waite, Steph and Merle. Just see that Kevin Martin's just joined us. Hello, Kevin. Okay, so you should have the notes here. They have been shared on the Jenkins-Infra IRC channel as well. Let's get started with announcements. So the Weekly 2040 is okay, has been done. So the release of the Sinewar went well. We had the hiccup during the packaging, restarting the build, fixed the issue. And I've just triggered the container, so you expect the release to be finished with the last backlog items in the coming hours. A note about, I think that will be worth opening an issue about the problem that has been seen during the packaging step. I'm opening the Jenkins CI-slash-packaging repository, which has scripts for the different kind of packages. We are looking at the script in charge of publishing the war once generated. The goal is to retrieve the generated official war from the release from Gifrug, and then copy the war file to the different mirrors and virtual machines acting as mirrors or references for us. And one of the final steps is to upload the website for the war, which includes a few HTML files. It happened in two instructions. The first instruction copies from the temporary directory, which is the variable $D on my screen, to the official war directory. That's a copy from a local directory to another local directory. Then the second step, Ersync, does exactly the same except it runs on the remote PKG server machine. That's why Ersync used the SSH protocol for the second instruction. That's the second time that the first instruction from local to local, it's a permission denied followed by a connection timeout while running Ersync. That's the step where S. Ersync has created the footer and header HTML as temporary files inside the directory and are trying to move them to override the existing files. That steps fails with a word and unexpected message error. If you retry the builds, that will just fix the issue. The reason is because the war, the variable points to a moon points in which there is a blob storage account, which is a kind of it's like the same as three things. But for Azure, it's not a POSIX compliant system. It's an object storage and not a file system. We use Kubernetes which use CSI driver, which converts and gives the impression you are browsing the directory while in fact it's sending request to a remote HTTP server. It's not fully POSIX. That driver used the dreaded CIFS system from Microsoft that may work or might not. But what is sure is that that CIFS implementation is not POSIX. So for sure, Ersync here tried to run a system call, which is POSIX. But the implementation seems to panic because not only we have a permission denied error, which makes no sense in that case, the permission are fully are 777 on that system. And also it says time the woods, which is the word one. I mean, writing a file in time it's that's something you haven't seen since years, right? So yeah, I think it's worth an issue to explain that we have to retry in that case that happened some time to time. Long term, we have to fix that issue by replacing Ersync here by Blobix or better Azure Blob, whatever copy that should work exactly the same as Ersync, except instead of copying from local to what looks like a local directory, will be sending it directly to the storage system. Is there any question about that topic? Okay, it's on me to open the issue and point it to the, I think that will be on the release. Helpdesk can eventually update it here. So is released package incoming Docker image. Do you have another announcement folks? No? Yes, yes. We've got it's already been announced. No, it's in the upcoming calendar section. Security advisory will be published tomorrow for Jenkins plugins as disclosed by the security team earlier today. Yep, I see we have an official image. Security advisory tomorrow. Right. So let's look at the upcoming calendar then next week we can expect Jenkins 2.401. Next week we will be the 18, right? Yes. Correct. I don't remember when is the next LTS. Good question. It's four weeks since the last one. So it will be May the 3rd May 2.387.3 and baseline selection for the next LTS should happen. Ah, yes, by next Wednesday. So 2.400 or 2.401 are likely candidates. Next Friday, did you say? Next Wednesday, 19th, 19 April. Okay. So the advisory for tomorrow is plugins only, which mean we can work on Kubernetes. We have to be sure that Trusted CI and the PKG machine are okay during the process. Well, and in general, during the publication period, it's best if Kevin and I do not merge anything to the Jenkins documentation site so we don't disrupt the security team. They adapt usually quite well, but if we can avoid merging, they'll will certainly be grateful. Okay. Next major events tomorrow. Devox France in Paris. Erwe will be there if you want to meet. Yes. Thank you, Erwe. CDCon is in May. May 8th and 9th, and I'll be there. 8, 9. And Alex Brandes will be there as well. Oh, nice. We don't have. I don't think we have a, yeah, exactly. We'll have one in community discourse. Another major event coming. Well, so I've got a, this isn't a calendar item. Maybe it's back to announcements. I forgot to make a fun announcement. The plugin site now has health scores displayed on the top level, on the top level plugins. So you see a rapid indication that you should not use the GitHub organization folder because it's a 65 out of 100 score. Yeah, the plugin site plugin health score is now visible on the plugin site. And yes, there's more to more, there are more improvements coming, but it's the fact that it's already visible. Thanks to a Google summer of code candidate that wanted to continue contributing. Even if it wasn't in the gist program. Exactly. Exactly. As Erwe said, even though the project to do this work was dropped from the Google summer of code work, the contributor said, I want to do it anyway and went ahead and did it and is working very nicely with it. That's really impressive. Thanks for that. Is it an announcement or an announcement without the E after the C? E after the C. So the one on line two is the correct spelling. Yeah, there we go. No, that's got to be a French word. Is it not a French word? It ends with M-E-N-T. It's got to be a French word, isn't it? Announce. No, okay. No, no. Forgive my linguistic boundaries. Continue. So let's get started with the tasks we were able to finish. So first of all, thanks Erwe for contacting Digital Ocean and ensuring with them that they continue sponsoring us at least until the end of the sponsoring one year cycle. They gave us enough money to continue at the defined rates which clearly removed the month of March where we overused the Digital Ocean due to the AWS issues. So thanks Erwe. That's really good news and that's also a great opportunity to see if we can continue the sponsorship. We've closed the issue in the help desk because that issue was only about infrastructure tracking but we expect eventually, stop me if I'm incorrect, but the blog post greets them because that's really nice of them to help us on that area and to be so quick to do that. Is that correct Erwe? Kevin? Cool, thanks. Out of space on a C.I. Jenkins I.O. agent in Bumbill. So thanks Erwe. Thanks to your work we were able to ensure that any agent is now mounting the slash TMP and the M2 default repository folder even if not always used by the Maven builds because we specify another but now they are mounted as an empty there. An empty there is a directory directly mounted on the virtual machine hosting the pod containers in opposition of writing by default inside a container file system which is terrible because if you try to write on slash home Jenkins for instance that will be written on a low performance system. Initially we wanted to mount slash home slash Jenkins but Kubernetes doesn't behave like Docker so when we mount an empty there inside the directory inside a container empty there is not named empty for nothing it empties the directory. While Docker usually when you mount a data volume on Docker it copy the data from the image that was in the initial directory and replace it by the mount point like you would have on Linux for instance. So yeah we decided to at least define the M2 repository which can be accidentally used by other users but not mount the slash home Jenkins home there because it contains required files that we built so that's why along with that everybody was able to measure let's say worst case situation where a single bomb bill was generated was generating 22 gigabytes inside that empty there. So since we run free pods at the same time and the fact that once the pod is stopped the empty there is cleaned up we were able to say or instead of 200 gigabytes on AWS per machine we should decrease to 90 gigabytes that should allow us to gain some bugs that's not a lot but it's worth not using too much. On digital ocean it's a bit different 200 gigabytes is the default of the machine we cannot decrease it so we have way more space on digital ocean Kubernetes nodes than what what we have on AWS. So that issue about out of space for the bomb bill is definitely very close thanks Hervé. As usual if you see anything related to the space usage for the bomb builds on CI Jenkins IO or on any plugin builds that are both running on container please open an help desk issue that might or might not be related to this. Problem in fighting an artifact from a third repository so Hervé was able to fix the issue for these users so it's the second or third time that we have user building a plugin that use artifacts from a repository which is not ours so we need to add exception. That was also the opportunity for Hervé to start a script to check on all the plugins the the kind of usages so it's local right now but the goal is to identify the repositories that we don't have mirrored on gfrog and that could or could not be used by plugins. So I understand it's still an early step but the nice discovery is that why Q command line can treat and power 6 ml which is really useful when you want to make simple requests like this one. So as now Hervé expects from me a method to get the list of the mirror repositories that we have on gfrog that's an API call that anyone can do it that doesn't require authentication but I need to share it and we will continue on an upcoming issue. The goal is to check with the Jenkins security team if each of these repositories are acceptable and should we mirror them or keep the exception? Not that it's it's not a problem to have the exception in the setting because the goal of ACP is to decrease the bandwidth from gfrog instance. So if we have exception like this one that means that our agent directly connects to the other repositories they don't consume through gfrog so it's not the problem for the goal of the ACP itself. It's a problem to maintain the list of exception though because that could cause issues like this one and also that's a point about that could be discussed on the plug-in health core area. Should we score a plug-in? Should we add one new score that will say hey if you don't use the gfrog mirrored repository meaning with infrastructure and Jenkins security analysis then you might lose a bit of scoring or if you are maybe a positive one if you use only gfrog then you increase your scoring. I don't know how it works but that's a discussion to have because if you use external thing that aren't scanned by the security team then that could create problems with your plug-in. Did I miss anything Erwe? So Erwe's tool potentially could also be used to explore scalability questions on artifact caching proxy or to prime the artifact caching proxy. I mean what you're what you're doing is creating a large test case for the artifact caching proxy aren't you Erwe? No my what if how I retrieve this information is by querying the Power MixML in every in each plug-in. I haven't directly related to the artifact caching proxy. But if it were executed behind the artifact caching proxy you would cause the artifact caching proxy to be loaded with that content wouldn't you? Yes but it must be a first mirror head in for gfrog instance for the artifact caching proxy be able to cache it. Okay so yeah in a sense it would expand the the the the amount of dependency cached by the artifact caching proxy but yeah so it's to be in the gfrog instance. I'm accustomed to using a maven command to fill the caches the dependency colon go-offline command I may send that to you separately in case in case you want to try that it's it uses maven to do the parsing and then it does a full recursion of all the dependencies. I'm interested yes thank you. I'll send you a link to it and okay repositories yep next issue that we can't reset jirac on password that's a account we were able to successfully renew the signing certificate for Jenkins core so congratulations to everyone that was a huge team effort and we did it with a 2.400 version and with the Jenkins latest LTS along with updated the GPG keys so now we know how to how to run and the expiration of both the GPG key and the GGSerts code signing are in three years both so we will change them the two of them at the same time next year there should be soon a post mortem on what could we improve including doing it six months in advance so we are sure that it's not late the goal will be to avoid reaching the expiration date when we switch the keys yeah that's all for this topic we still have an issue about updating the documentation that should be fixed soon. CI Jane Kinsayo this was the most full thanks Stefan for taking care of that huge one that will generate a lot of discussion and changes and fixes we had the leftovers like 60 gigabytes of leftovers of backups and stuff like this we had 100 gigabytes of not discarded build logs and a lot of builds are storing a lot of archived artifacts on the file system so we cleaned up anything we could everything has been done here so the issue was closed because we were able to to go below the 80 percent usage thresholds issues have been opened for all of the fixes so we'll come to this later we also add the same kind of issue on trusted CI Jane Kinsayo but this one wasn't because of the Jane Kinsom but of the amount of docker images for each LTS update we had since one or two years so fixed by removing these images thanks again Stefan for taking care of monitoring and ensuring the platform works we updated all of our controller to the latest LTS version released last week with the new code signing certificate and the GPG key was also the new GPG key was also used for that LTS so no more signing certificate issues congrats did I miss something on the closed tasks okay let's proceed we have a lot of running issue and new issues as well first realign so the goal for the issues that we have there it's a Kanban rule do we keep working on it or do we postpone I propose to postpone the realign repo Jane Kinshiai or mission I haven't had time during the past three weeks to work on this topic the HA high availability held up to sustain if we enable authentication of the GFrog mirrors right now we are waiting from GFrog to meet them to to make a status especially with the amount of data that should be not downloaded due to the let's say abusive IPs but it looks like it's a bit more complicated than that as Mark underlined we might have people using the mirror that as a free mirror so we might need to enable authentication the upcoming week won't have any time to work on this so unless someone objects I will put it back to the infra team sync next and until we meet GFrog is that okay for you Ubuntu 2020 for upgrade campaign so that went pretty well Herve and I were able to deliver this one for the agents so now all the CAI genkin saio agents are using Ubuntu 20 to 04 everything went well with a tiny minor exception switching to Ubuntu 20 to 04 broke the some uncivil test case on the packaging when using the hold amazon linux 2 uh that might be related to the system the and cgroups updates Ubuntu 22 features at least cgroups version 2 which changed the way the control groups are run by underlying container and it's not the only major upgrade but thanks to basil the nice work has been done especially pumping the amazon linux operating system version which work very well inside Ubuntu and other gdk related issues so thanks basil sorry for the breakage here that was one tricky and now I propose we keep working on that Ubuntu 22 upgrade campaign so we have the following item being worked on is another issue from that Stefan is taking care about migrating trusted.ci to Azure I understand Stefan that you proposed I think the issue is here 2486 for trusted.ci machines so I understand Stefan can you tell me if I'm wrong you propose to start the new machines for the three virtual machines that are currently running Ubuntu 18 on AWS to start directly to Ubuntu 22 on Azure yes we're trying cool so Hervé you will you will have low low bandwidth this week so I don't expect you to spend some time on Ubuntu 22 if it's okay for you I plan to check and eventually upgrade the node groups that we could have on Kubernetes on our Kubernetes clusters check the Ubuntu version if any on as on AKS I think it's AKS and eventually Digital Ocean if I see that there is a possibility to upgrade the underlying node groups I will start the operations to do it during this week any objection on this one no great and eventually docker-openvpn I'm sure this one uses as base image so these are the three next steps for this issue is that okay for you we don't have to finish this free subtask for the upcoming milestone but the goal is to do a little bit every milestones so I'm adding the new milestone unless someone objects okay let's continue on the tasks document the code signing certificate renewal process so that one will migrate on the next milestone the pull request is open so I'm waiting for a review approval and if everything goes we merge it worst case we have a few changes to do to the doc but that one automatically moved to the next milestone Stefan about Azure RM64 can you give us a status and let us know if you will be able to continue working on it during the next milestone I hope to be able to as a back work okay for now I'm stuck with silly problems that I don't quite well understand but I tried to open an issue with Baker so I'm hoping to have some some direction to follow I'm stuck with the RM64 versus AMD not allowed to be used okay so thanks for opening the issue may I ask you to add a comment here to report what oh yeah what kind of issue did you met and point here the issues you opened on subsequent just to document it yes you're right thanks as a reminder the goal of RM64 is to be able to get rid of agent virtual machine on EC2 and to eventually start studying what we could run to decrease our costs but that's secondary next issue password email not coming through so we don't have access to the sun grids configured email sending server for account Jenkins I also we cannot check when an email doesn't reach a remote machine so if it's okay for you are they are you okay I will come on this issue and the goal is now that we have access to the mail gun accounts at least for accounts Jenkins I owe the amount of email is low so we should stay on the free chair as I understand so I need to create account for both of you Stefan and Irvi and then Irvi should be able to update the configuration of accounts Jenkins I owe to switch to mail gun so we should be able then to work with that user and solve upcoming issues is that okay for you are they so I'm co-assigning and I will take care of commenting reporting on that issue so you should be able to start working on it as soon as I've sent the mail gun account we have an issue about artifact caching proxy being unreliable so it was in the there were two errors one on the bomb builds running on digital ocean so we should be able to check it again and the second issue was when trying to use all the steps of the 80 HP so we will have to diagnose a bit more it's for the first case for the second case it looks like a lot of network errors so there are some incoming issues I will move it to the next week and we'll continue diagnosing here because there isn't anything obvious so it's a low level thing especially in the network area so yeah that's anyone willing to take some time by default I will take some time one of the main actionable we have here is to change the network where the CIG and Kinsayo agents running in Azure are spawned the goal is to move them to a closer network than the ACP server and see if the issues continue happening on Azure and in for digital ocean it's a bit more subtle we need we need to dig more any question so I move this one to the next milestone we continue working a bit on this one we have add launchable to agents so I understand they're very volunteer for this one let's remove triage from here so the goal is to install the launchable command line on our packer images at least to be sure that it's available for already it's not needed to be installed each time so it's at least for Linux ideally if you are able to install it also on Windows that will help basil a lot I'm moving into the next milestone one check is ensuring that we don't need launchable on the web builder images for the website running on CI Jenkins IO I propose it as a secondary objective that's a question to raise but I understood it was initially for pipeline library meaning for Maven calls so let's see so Stefan thanks for opening that issue about migration of trusted CI Jenkins IO from AWS to Azure there are three goals the main goal here is keeping control of our infrastructure by moving sensitive machines in clouds that any Jenkins and fra team member can manage the AWS account is still used and provided by cloud base which is very kind of them because they they pay for the bill but that doesn't allow non cloud base employee to access the management of these machines so the main issue here is the safety by moving trusted CI Jenkins IO and associated machines which are in charge of generating update center deploying Jenkins IO and some other trusted tasks the goal is to move them in a dedicated network in Azure virtual machines so we should be able to streamline the management secondary objective is migrating these machines to Ubuntu 22 LTS that we refer earlier and the third objective is to try to decrease the AWS bill that will help on that area because we should have some margin to to pay for these machines on Azure um thanks Stefan we have two lists that looks really good about what are the expected tasks so are you okay to walk on this next milestone yes with pleasure cool um an issue that is almost plausible there has been an issue on the automatic renewal of the certificate for updates Jenkins IO and Jenkins IO.org the certificate has been renewed we have an event in the calendar in two months to check for the next renewal event the last step is before that we need to enable logging of the Chrome tab renewal to the syslog system on the virtual machines that's an option on the pooped module that we use instead of having a third bot renewal quiet mode which doesn't help us to diagnose what happened most probably the failure in automatic renewal come from the breakage I did last month on updating all the python installation and third bot versions but we don't really know we don't have any logs that shows the error so that's why we will need to be careful next time so once we are sure that third bot renewal commands is written its result is written inside syslog so applied we wait 24 hours and we check the syslog and we see that the third bot renewal should say hey I've tried to renew the certificate and they are not bound to expire once we see that then we close the issue clear for everyone yeah thanks Stefan for the help on this one so that one moved to the next and now we have billing behaviors so last month we exploded the cloud billings on all the services all of them the root causes are a drastic increase on the bomb build that are costing a lot on the hh build which is the same but also we had a lot of releases and a lot of bandwidth and download from the mirrors packages and from the update center sorry we also seems to see consequences of this that increase in builds on different areas first of all I try to details something that was discussed privately because it was cloudbiz internal due to the AWS account so now I've published what the excerpt of the discussion the goal is to decrease what we consume on AWS trusted moving the virtual mission of trust is ti is one of these elements so there is an issue with a lot of details on the short term leverage for that milestone we have a work that is in progress about cleaning up the snapshots created by packer that should be almost thousand dollars per month once it's finished we have a work about trying to optimize the bomb builds the for this milestone the goal will be to split the node pools between bomb builds and plugins build so we will be able to check the cpu and memory usages and see if we can optimize packing pods or maybe move the workload to other clouds finally we have migrating updates the agent update gent in saio the Apache server serving the update center index costs between 3k to almost 6k on busy month per month of outbound bandwidth so given the relationship we have with digital ocean the proposal that we all agreed privately on that we can start discussing is to move that machine as it as a first time to digital ocean because we have a partnership with digital ocean that provide it's Intel machines so we should be able to to migrate it efficiently and we have terabytes of data we saw 30 terabytes of data per month of outbound bandwidth and we don't have to pay for that that for that outbound bandwidth on digital we wanted to use oracle a few months before but the partnership with oracle is still it's not bad but it's still a tiny partnerships so we prefer going to digital ocean right now because they are really really really at ease with us and then we will see to extend the to make that service available in the future right now we could avoid spending free to 6k per month so that should be a huge win we have the trusted CI migration that should also help us so we are at that state the proposal is we start with these elements and then iterate the week after so that one move automatically to the next splitting session that will be a long running issue sorry for that folks um Stefan one big for you that will move is the certificate for the update center generator crawler and update center so that one I can help I will need to because I did the first part and I need that to be signed by someone with the rights to yep I propose that we pair on this one is that okay for you I thought you were not in the list of people having the the cert I might that's fine I might but I propose we pair on this one that one must be done it's end of May but better to do it in April no yeah okay for you as soon as the better yes cool um finally one last issue is uh related to CI Jenkins IO and related to the disk full that we had so we saw a lot of outbound bandwidth so to summarize the discussion um we want to use an Azure artifact manager plugin that will ensure that CI Jenkins IO start archiving artifact inside an Azure buckets the goal is to reduce the pressure in time of IU on the data disk use for the controller and to decrease the storage needs because the archive artifact will be on the buckets what why will would that help us on the outbound bandwidth because we should be able to measure carefully really really precisely what is the outbound bandwidth caused by stash and stash to the Kubernetes cluster on AWS and Digital Ocean compared to the data downloaded directly through the web UI of CI Jenkins IO so that one require first configuring the artifact caching proxy then we will have more discussions to uh we have different options that we need to report on that issue about that we had a lot of issues in triage mode so let me open all the two triage issue because some will need to be done so the artifact manager to store archive artifacts sorry I'm going through the issues one by one that one need to be checked we missed this one so I am adding it here that's an account one um we need as you both a few thanks for that work when you check the disk issues uh in order to ease our operation we need to add labels or elements that will help us to immediately detect which virtual machines and which metric on data the Poopet module uh thanks for from Mervey Researches shows us that we can use configuration variables to enable AWS automatic detection and to force some labels that we can add with the name of the machine or the service that will help so that one I propose we move it to the upcoming milestone because it's one or two lines any objection agreed I'm in the mode of doing a lot of Poopet and Yarra data that one is related to CIGEN Kinsayo disk full the goal is to install the global build discord and add a global build discarding policy by default unless pipelines or or job configuration say something else that should help because based on what we saw some of the builds didn't add any so having a global build discord will helps a lot on reducing the the amount of storage so that one is also a plugin installation any objection if we install this one with the same transaction as the Azure artifact manager yeah no we have to take advantage of that that's a plugin and then configuration will happen after that one so the Azure artifact manager I've added accidentally the outbound costs on the upcoming milestone I didn't have so let me that one is the direct actionable for that milestone while 3485 this one is we don't have any actionable until the other is done so right now I'm moving it here artifact caching do we have another yes yeah space one that's correct yeah this one can be done but with the other one that one we don't need triage for this one it was already triaged change the disk space is below one gigabyte to 80% disk usage that one is interesting but not prior I don't feel we will have time this week is it okay if you don't think it will it will match with the other one for for data dog that's not the same kind of absolutely not the goal here is to change the the request we do in data dog so it will be alerting us when we have 80% of disk usage or one gigabyte because we have some disks that are really short in size so we need to improve the requests I won't have time to spend on this one I don't say it's not important that you say it's not top priority how do you feel about this one no advice I thought that would be an easy one contact it if you want if you prefer as you want oh it's not it's not to take it and not to take it it's do we is that that's something we we we can manage for for the next week that's that's the point because we really need it I don't feel really easy but yeah it's not that easy yes we need it but it's I'm assuming it's not that easy most of the time when I think it's easy it's not so I got it so I've added to the next milestone since survey volunteer and let's see if no no obligation to finish this for the upcoming milestone especially with the low bandwidth you have sounds good for you folks yeah okay that's all for me I will update the notes do you have other things you want to add to the next milestone signing autograph for everybody but that's not an issue fair okay so I'm stopping sharing a screen I'm stopping the recording so for people watching us see you next week