 Hello everyone, welcome to the Jenkins Infrastructure meeting. Today, we are the 18th of October 2022. Today, for this meeting, myself, Damien du Portal, we have Hervé Le Meur, Mark White is off, Stéphane Merle, et Bruno Verrard. Je pense que c'est à Pouding. C'est vrai. Ok. 1, 2, 3, 4 personnes. Let's go. Announcements. Le premier week-end, donc le week-end était réussi. Je veux juste une tabulation. Ok. Donc, ce sont des questions. Le stage n'est pas Rainbow, mais tu me suis donné un côté coarse, tu lançais mis le domaine. En fait, ce sont des panneaux par suck. Massivement, est-ce qu'on absorbe la charge de la charge de la charge ? Je m'espère que ça fonctionne bien. Et 2015, rénif müe et un nou浦 et une descubriture. Change log et j'assume d'autres items qui doivent être faits. Donc, sur le point d'infrastructure, il n'y a pas d'action qui est requérée pour le réalisateur et Stéphane, vous devez pouvoir procéder pour l'image docker pour weekly et vous pouvez le délivrer sur le management de Kubernetes, si c'est ok. Perfect, merci. Vous devez aussi avoir des issues de release drafters sur l'image docker. Donc, n'hésitez pas à les mélanger, si c'est ok pour vous. C'est un patch, comme vous le souvenez. Donc, oui. Nous pouvons aussi commencer à réveiller les choses dans la release CI, qu'est-ce que vous avez besoin ? Est-ce qu'il y a une question pour le release weekly ? Hop. Annoncements. Right now, the ACI, so Windows Containers, on CI, JenkinsIO, are currently broken. I was walking on that. It's a change I made one or two hours ago. The goal of the change was to bump Maven to the latest version because these two images were different, bump the GDKs, but we also changed some elements on the base image. It's currently broken, I don't know why. I'm currently diagnosing, so I will fix it right after our meeting. So that means we might see some jobs piling the one that require a container Windows machine. Jobs in the queue. One of the key issues here that Irving and I underlined, thanks for James Nord for helping confirmation, is in the case of using Nano server. So it's a kind of lightweight Windows container system. It's built for container, for mono process, mono user, usually. There is a system property named user.om in Java. That one is set to see to the C drive instead of user profile. The main reason is because some DLL and libraries that are usually present on Windows Server Core, like on our virtual machine agents, these DLLs are missing. So we have some acts for some languages such as Golang where you get DLL from the Windows Server Core and you copy it on the Nano server image, but that trick doesn't work with GVM. I don't know enough of the underlying GVM issues. But the consequence is that Maven was setting its local repository in C drive instead of the user profile. So that partially explains part of the slowness that were reported by some contributor because the Cm slash m2 directory is not a data volume. It's the container layered file system which means reading and writing within that repository is slow as L. So it's not enough to explain some of the slowness because some were on the virtual machines, but that one clearly can get some improvements. So that was the reason to a bump Maven on the wall image, but it's currently breaking, should be fixed in a few hours. I will send an email to the mailing list. Reminder, tomorrow, it's now official, there will be a security release. I will add it to the agenda. As far as I remember it was only plugins, but a lot of plugins. So please don't break anything tomorrow. I don't remember. I think it's beginning of the afternoon in the European time zone. So yes, we can break in the morning. Yes, but you need sometimes to fix your breakage. But no worry, just check the channel. Everyone from the SRE team should have been invited, right? If it's not the case, I will add you. Challenge yourself if we have to deploy something on production. Ask yourself if it's needed. Most probably Kubernetes is okay, but Poupette might be more sensitive there because it begins on me. The channel is for today. Yes, but it was initially planned today and it has been delayed tomorrow. I asked the question Monday because I was also stressed out by this. So upcoming calendar, unless you have another announcement. Okay, next weekly, next Tuesday. We should be the 25, right? Yes. Next LTS, I don't remember, but it's in November. We'll get it from previous meeting notes. It's been uploaded. But next security release tomorrow, 9th of October. Next major events, no major events. Okay. Something to say before we proceed to the work we already did. Nope, okay, let's proceed. Just a big thank you for again for that system that's created a note for us. That's really cool. A lot of GitHub permissions. So nothing that we had to do. You can see based on the avatars of the person in charge the issue that Alex was really, really active during the past week. So thanks a lot, Alex. There have been some work on the IRC bot. So now it's being built. It has CI integration like WaterV and Adrian did for the plugin health core. So the CI is on CI Jenkins IO and it's publicly available. While the deployment, the real deployment is done on Infra-CI once a release is tagged. So now in the case of IRC, but it seems that they worked a lot to clean up the repository. That was a long outstanding task. Thanks Alex and team and their view for the effort there. I understand that we were able to deploy one or two new versions because one was failing and the second one was a fix. So I don't think there are much more actions from the Infra team. If it's the case, don't hesitate to raise an issue as usual. So add IRC bot repo to Infra, or folder on CI that was in the same area. One week ago, we had issue with the Kubernetes pod on CI Jenkins IO that happened late on the day for European. The autoscaling process was broken on the current Amazon cluster. So short term, I was able to remove the faulty cluster from the configuration for the night and the next day I fixed the AWS autoscaling configuration and everything went back to normal. So the outage was around one hour and a half. During one hour and a half, the bill were not lost, but just piling and trying to allocate agent. Once I applied the fix, all the bill were treated in less than 15 minutes and then no more issues. So thanks for the work on the AWS pod. I'm all permissions. And finally, there was that request from Alex about the local resource plugin from Private Infra. The goal was to remove it, but it appeared that we are using it somewhere. So better to close the issue and not act on this one. Any question on the closed tasks? Nope, okay. Let's move to the walk-in progress to see if we have to... The status of each task and then we have to check if we can work on it next week. I'm going to start with the lost access. I'm going to follow up from the meeting notes, none from the order that you see on my screen. So lost access to publish releases for crowd-talk plugins. So we had a request from a plugin developer. We discussed that. The developer had to follow up carefully in the public instruction. And what happened is that using the Artifactory UI instruction are working as expected, despite the documentation not absolutely clear, but if you follow carefully, really carefully the wording, that works very well. Daniel, Marc, and I tested that since the past days. However, the contributor was using the curl, there is a curl commands, which goal is to retrieve the settings XML with your password encrypted directly from Artifactory on the Gifrog repo. The thing is that the user complained about receiving forbidden access, which I was able to confirm. So earlier today we checked and Daniel switched permission. It sounds like one permission has been removed a few weeks, months ago. So no one was able to retrieve that one except the administrator of Gifrog. But the held up account weren't. I've initially opened an issue on the Jenkins IU website to either update or remove that paragraph. Daniel fixed that a few hours ago and the user confirmed that he was un bloc to release their plugin. So let's say the most pressing part has been done. We see that there's an issue could not transfer. So the user is still having issues. Okay, so something is wrong. So we have to help the user even if it's not really our area. So I propose that we keep that issue open and we move it to the next milestone. I don't mind continuing working on it unless someone feel okay to do the settings XML. A fresh new skill that they learned two days ago. No, okay, I tried. I tried to turn their view, but no. If no question. Moving to CI job stories is not on the link pull requests. So I check this one. That's a. So infrasci is used for the static website Jenkins IU stories Jenkins IU I think plug in site at least the front part on infrasci. We create and deploy previews for the website and some of them have a docker image built by the master branch on infrasci. Problem is external contributor. It's October fest external contributor when opening a pull request when the job is failing. They don't have a feedback from the CI system. We were using GitHub checks. You know, in a pull request, you have a check stabs that provide something that Jenkins send back to GitHub. It appears that it's disabled by default. So if we want this, we need to change one Boolean to true in the configuration of the infrasci jobs. I talked that these jobs were enabled for the GitHub check, but they were they weren't. So I ask Kevin if it GitHub check is okay or if you really want a revamp to follow up what has been done on IRC bot or plug in health. So the full CI on CI Jenkins IU we have to re-enable the job correctly and only the deployment parts happen on infrasci. So that's the credential are in a private controller. So I will, if it's okay I will move this one to next milestone depending on what we will have I might need to ask Stefan for help. I will point you to the correct setting to change but I will need you to implement them so I can focus on something else. But depending on what it is both will be interesting. Looks good. I keep the assini on myself and then I will assign it with details and I will contact you if I need your help. If it's easy I will take care if it's the GitHub check only I can ask you to pair if you want. If it's the other I will ask you for help. Any question? Next one, the biggest one introduce artifact caching proxy for CI Jenkins IU. Hervé? So I have the artifact caching proxy working on the new public case cluster. I'm now working on the Windows part of the pipeline library in build plugin function. I have some issues getting the settings and setting security for my van working. Okay. So we on the pipeline library for Windows. So work opt in with Linux. You are working progress on the pipeline library and you have issues with Maven on Windows and around setting the XML and stuff. Is that correct? Yes. Okay. So good work because it was a lot of work especially the AWS cluster part. So nice. We are almost there. Question. How do you feel about eventually splitting your pull request and delivering for Linux already so we can start moving some part to the caching proxy and start seeing the results before focusing on Windows? Reason being there are still some unknown parts. Yeah. I'll check on the test plugin to see if they have Windows build. They might even not having one. So it won't be a problem for this test. Okay. The idea is maybe only setting up the authentication only for Linux on the first time and see how it behave in real life. The reason why I am asking is because the Windows part seems tricky. It involves some changes on the Windows image. It behave differently between nano server, Windows server course so virtual machines and containers. So given the effort on Windows it will be interesting to validate our hypothesis already on Linux. That will be a good first step. What do you think? Yes. I have to check where I want to test it. It may be they may they may not even use Windows build. Yeah, but even if they do that only means that they won't use the caching. Proposal is that I let you finish for today and you have a few time to see if there is an easy way for you to get out successfully on Windows and if today you are not able to find tomorrow you will have to work around the pull requests. I understand that it adds some overhead because you will have to split the pull request that require some additional effort. But yeah, the goal is to have something visible that we can show to GFrog as soon as possible. Even if it's minor diminution, we really need to ensure that the caching proxy works as expected and it's not also a blocker for our users. So as a reminder, the announcement about the ACI was caused by the work that has been done by Alveder that allowed us to discover an issue with the Nano server image. Deliver the Linux part already. Any question on that topic? So we had a meeting with GFrog last week. So just a reminder I will add it on the news but they are okay with the caching. They say the caching is important that will help us a lot, that will decrease. It won't be the most efficient effort but yeah seems okay for them for that part. Makes sense. And improve QoS for us. So I'm adding that to the next milestone. Reminder if you are feeling bored with that task because it might be too much, don't state to ask for hand over we can change turn around. There is no issue at all so you evaluate my proposal on Linux could be a way for you to ask yourself if you want to delegate Windows crap to someone else. You are the only judge on that part. Next issue Twitter What's the status? I didn't took the time to validate my chart but I have one in draft I will try to find some time this week to deploy it on a cluster and see if it works then I can deploy it in broad public k8s. Cool. So I can move it to next milestone. Ok Thanks for the work there. Next one, key clock performance or Rific. So we didn't have time to work on that. As a reminder we have to migrate the PostgreSQL database from AWS to Azure might not solve all the key clock issue but still that for sure will increase decrease the latency. Stéphane, are you still ok to work with this one with me? Yes, yes of course can you think I can give you a hand beforehand and without you to smooth the way? If you are ok to start preparing the... Do you think I am able to Absolutely Ok, I will try then If you feel like you need to plan ahead so we can think up define a task list No problem on that but yeah it will be my pleasure if you could act on this one. That would be a great start. We pair together for a small time we define the steps and I will try to start it. Cool Let me take notes then so plug in nothing done we got to try the craft end short so then this one is key clock, key clock, key clock perfect You didn't follow what you said Yep Sorry for that So key clock Stéphane To work on the SQL migration to most tasks for the team. Cool, thanks Stéphane Always pleasure Ok, so I've moved to the next milestone I've removed myself there and here we are Now, next one Update Center Returning 404 I won't have time to check on this one Ok, I will take it unless someone wants The issue for the user should be solved thanks to Mark's answer However, I just want to consult Daniel but that will have to wait after the security release because he is busy I'm not really sure what where the dynamic update center is As far as I remember it was something to dynamically change the content of the update center maybe provide a GZNP wrapper I don't know but I would like to know if it's still a feature if we are expected to have something because it doesn't exist while for the previous LTS version there was some update center So that's why we need to check where does it come from I'm assigning this to myself and I keep it as a triage just to be sure if I have something to do Any question on this one to ask help from Daniel after sec release do we need to do something on dynamic update centers So might be an issue to open update center too I'm not really sure Next one, windows agent are so slow So I will comment it out I'm taking this one So that's the first part of the ACI I might have to fine tune the ACI configuration on CIG and Kinsayo to be sure that we use the that C and backslash M2 repository is a data volume and also so I might have to update docker images and the pipeline library to force the user.home repo Unless someone want to work on that Nope, okay Related to ACI stuff might need docker image update pipeline library update No more triage And moving to next milestone Realign Jenkins Org mission So for that one we had the meeting we had the meeting with Gfrog as mentioned earlier We are Gfrog that went well it was just a status with them So they the scenario I proposed on the draft GEP is okay for them which means enabling authentication I've checked with Daniel so Daniel quickly told me that it might be painful for that solution because that will require changing the tree of virtual repositories So I need to check if we can enable authentication only for some sub repositories the virtual ones I don't think it is I need to check because I understand that if we have to change the parent pump for the world Jenkins project some and user won't use it until they update their plugins configuration which can be painful However if we enable authentication on this part that will break their plugins so they will have to do something to communicate carefully so that's I understand the painful but to be checked with the community as well I think the hypothesis with them that's all I need to work on the GEP with the JEP Proposal Bazil is not is not available for the project we might ask him help only for validating one or two technical elements or if we are really locked but he won't have the workflow to help us so we are on our own unless someone knowing Maven on the community is able to help us so I think we will ask help from either team or Marc So that mean Daniel and Bazil to not have availability to work with us on this one only to give advice if required so I'm moving that to the next milestone and we can proceed Finish cleanup of Mirror Brain I'm moving that one on Team Syncnex because I didn't have time and we have too much task right now Priorities GFrog Which one was it Did it account Moving back Backlog What is the next one Collect Datadog Metric for FMRO virtual machine Stefan That's for me I was at the step of providing a dashboard in Datadog that could be useful for developers to check if they had any misconfiguration or mis-choice between the agents, Jenkins agents during their build or they have too much memory IPCUT or CPU It seems that the dashboard from Datadog is not the way to go after a little talk with Miquel Valenci because in the public display you cannot select anything so there is no way for us to provide information but on the agent that have been involved in the build so I need to check with James and OJC so I need to check with James if we can manage something directly with him or if we really need to find a way to provide those information to anyone and then to find something else as a matter of fact as it seems that for now Datadog dashboard are not usable like that publicly Ok, so that mean that's quite annoying so that mean if we don't have a solution we will have to stop working on that task and eventually write just a piece of documentation for the end user, for the maintainer, if they have an issue and they want to see inside they have to ask on the on the L desk and we can provide the matrix for that. We can of course get information on that but publicly available it's not so easy. Get publicly available and document for maintainers or to ask infra team to get data. So at least solution one that I wrote now we have the matrix thanks to your work Stefan. So at least we are able to retrieve because last time the maintainer ask when Basil and James ask for information we didn't add any so at least we improve and now we can observe the machine even afterwards. Yeah and I learned a lot with Michel because in fact that's not as as evidence as it seems that you can select with for example AWS or Azure and it will change everything and in fact no not really if you have only one graph so they really need to get back to us to correct and because we got all those informations. I'm sorry my English is not good enough. I'm not sure what you mean by we are they need to come back to us what are they people who wants information about the Jenkins OK. OK OK customers just to be sure OK so our customers the maintainers the Jenkins maintenance plugins maintenance they will have to open an issue and we will search based on our knowledge to be sure that we know Jenkins use these types OK. That will be the default solution I agree and maybe not spending too much time on that so what do you think if I let you comment the issue after the meeting you already had information to say OK you are going to update the documentation for maintenance and we can send an email to the mailing saying OK we have reached that level we don't have any much more time and maybe we can add a kind of issue that we can have Absolutely. One that is directly can I have information about the agent that have been building this build issue template on LDesk to request for agents matrix nice idea solution 2 which I call it long term that mean we need to switch the metric collection on a public graffana dashboard Yeah, that's the only thing I was thinking of which are add a public graffana dashboard for that so I'm not sure if graffana is able to retrieve data from data dog I don't think so that will be amazing but I'm not sure so which mean adding a prometheus collection for CIG and Kinsayo instead of the data dog or both, I don't know but yeah, that's the the other solution Does it make sense? If we want the solution want to still work we will need both for at least sometimes. Yep, correct That one we don't have the bandwidth for doing that right now unless someone is interested on doing that we can ask people for help there it's October first Nice job, at least we know we can help the user so it's still really really useful Thank you to Miquel Palancy Palancy He helped me a lot Thanks Miquel So, I'm moving the issue to the next milestone but the expectation is writing down a message that summarise the two solutions implementing documentation and third item adding the new issue template once the free are done we can close that issue because the topic is collect data dog matrix and then we will have to follow up on the parent topic add observability for the build agent we will have to command the solution on the parent topic but that one is should be closable once you will have finished that next issue publish pipeline step dog generator and backend artifacts so I worked on that locally I will need to update on FRACI so I will need from both of you ago when I can stop the Kubernetes management repo and when I can try to break the configuration of in FRACI as a matter of testing before committing so I don't know how much you are using it but yeah can I work on it later today or tomorrow morning I would like to finish with the windows cloud problem that we saw this morning you are working on in FRACI for that oh sorry yeah in FRACI a new version of in FRACI I don't mind I propose I will wait for you Stéphane when you will have updated Docker Weekly I will start after that is there any blocker if I work on this later today and tomorrow morning I think I found the issue on my side so I can continue to work on it after weekly upgrade FRACI from Stéphane now OK collect data dogmetry we did it did it account on account so someone require to remove the account I we have to delete the user I think sorry I took I did a comment but my comment is lost ok let's consider I messed up I need to add a comment from the person to ask them to send an email to the private Jenkins infratim email from the email used by the account which is not public here the goal is to validate the legitimacy of that request I'm pretty sure you sent it isn't it on another issue yeah I'm checking my mail so either there is an issue with github and I will add the comment again or no worries on that delete account delete account where is delete account I mean to ask no it wasn't on this one an email to private ok cool validate legity legacy it's just to be sure that it's not someone else acting for that user even if the Jenkins account name maps to the github account name and that it's a plugin maintainer who already who validates the change so Alex took care of since that person don't want to have an account anymore so they don't want to be able to publish and maintain a plugin the default assaini in Jira has been changed to that other maintainer of the same plugin so Alex thanks a lot for that now we have to remove the account so I just want to be double sure before deleting it so that one move to next milestone next issue upgrade our GHA using deprecated set output so thanks for noticing this one Alex and you have done it on a lot of issues so we continue working on this one there is a specific issue on one that you reverted on Jenkins weekly we looked at it with Stefan earlier today just to understand so one of us can absolutely continue using the set output name changed the one that is failing is the STD out at first site we understand that the new method from github is not supporting multiline values but I might be wrong I saw something on the creator date pull request action comment in another report I saw how we can prevent error on multiline bodies I have a fix for that I saw a fix for that in another report cool ready to either do it or share the solution I'll share the solution first and I'll take a see if we can cool using multiline strings which we do on docker, Jenkins whatever so at Herve and Alex on it ok can I make you co-assignee with Alex on that one ok you put yourself oh sorry I was looking at dark background on the avatar plugin documentation doesn't get updates someone at issue on plugin site wasn't seeing the read me updated so when the plugin site it generated it read the read me of each plugin convert it to html and generate static page for the plugin site however some part of the website are dynamically generated so as you can see on that example the version was not updated and the read me neighbor so in that case there were an issue because I don't know that part it appear that on infrasci the private controller the master branch of plugin site is failing since the 12 of october not sure if it's that master branch in charge of deploying the image or not it looks it is so the issue is fixed but there is a miss they don't know the root cause it come from an update center behavior where a few plugins were having empty empty settings on some of the shizon part and these empty value were breaking the generation so they don't really know there is no expectation as far as I understand from the infratim I will keep monitoring this one unless someone is okay so yes I'm adding that to not directly actionable by infratim as a reminder we might need to rename that milestone or I wouldn't too much negotiation with too much people I understand so I'm putting that one on the backlog on infratim sync next but I assign myself so I will receive if I see any action required from us I will move it back to the current milestone looks good to you cool I can't able to get email for the password so user request they want to update the password for the Jenkins account so okay that's the one where I already put the message so if next week we don't have any answer back from the user then we close it as not planned could be our fault due to the spam system if the user tried before fix the account app deployment could be something else I don't know but we need more information but yeah we don't delete or reset password just with a simple command we don't even know which account it is though so let's move it to next milestone and next time we close it if no information finally the last one I got on the work in progress is someone is having issues with a software called Radat Satellites which seems to be a radar proprietary system to manage large scale infrastructure and they seem to have missing metadata when using the RPM repository for Jenkins they were using archives.jnk I told them to use pkg they gave us information which I don't understand at all and I don't know so some work is required there I will try to see if we can install Radat but yeah I'm sorry for the user who raised that issue but I don't understand at all what the problem is and it seems like it's making them mad so yeah we're the one so I will start by trying to check the latest version what is repo data and metadata I remember there were some metadata associated to the RPM repo but I'm sure we have some otherwise you won't work so gotta try but maybe it's something new in the Radat, CentOS, Fedora mess of operating system so we'll see if anyone has that knowledge please help us I will update the note after that that's all for the work in progress so first of all did we receive new issue we have a Jenkins mirror requests so we have to answer to them people want to host mirrors in Australia and Asia so let's add this to the next milestone Stefano you okay to walk on that with me yes I will have to learn that cool I'm adding you as a co-assignee so first step we'll be retrieving the information we have an issue about documenting that I think we have a runbook with some information as far as I remember so we can make this public if I remember correctly we need to force in the bandwidth usage because it depends of the area in the world and the people using it exactly but we can give an order of magnitude by sharing with them what we have for archives Jenkins say if it's okay for you okay so that one is the one, Kubernetes 1.23 so let me update no more try edge that one it's already I'm looking I've lost the issue but Kubernetes 1.23 what's the status on this one folks I did the update of the kubectl and we were reviewing the full changelog before getting started with DOKS I did put no I don't know how in English you say that but I started the PR for DOKS and then I moved back because we need to finish the changelog and I'm not able to do it by myself okay so let's see how we can plan the updates I propose that we start by DOKS and we plan for AKS next week sounds good for you in fact the upgrade by himself is already prepared but we need to finish all the changelog it's more a target date when do we want to update the stuff depending on how long it takes to do the changelog we say we do that whatever day and then we can we take the time based on priority I think we will have time tomorrow because we cannot do much because of the security good point I think that's already a lot let me check the latest issues do you have other issues in mind that are important to treat okay so these are artifacts so we have the carrups labs repo carrups labs sorry for this one we won't have more time separated by plan or the rest are set outputs I don't see other new things so propose we close unless if I miss anything please open an ldesk issue or mention it and I will take care of not forgetting it for next week milestone is okay I will upload the recording of this meeting once it's available all the other have been done Hervé just in case you can delete all the akmd notes except today once you will be finished with them yes do you have other questions before we close no okay so let me stop recording and see you next week