 Hello everyone, welcome to the Jenkins Infrastructure Weekly team meeting. Today we are 30 May 2023. Around the table we got your servitor, Damian du Portal, Erwelemer is off today, Markway, Tester, Stefan, and Bruno, our collaborative destruction of the nodes. Perfect. Let's get started with announcements. The weekly release 2.407 has been released successfully, at least for the packaging part, including the Docker image. So that means, Stefan, you are ready whenever you want on con. You can update weekly on infrasier. It's currently happening. Container images are verified, change log is published. There are a few more items on the checklist to do, but they'll probably be done before the end of this meeting. Cool. Mark, just to double check because we mentioned it one or two weeks ago. So it looks like that creating the tag on the Jenkins CI slash Docker repository was effective and in a few minutes, ten minutes after that, only the new release version was pushed to the Docker hub successfully. Is that correct? Yes. So I created the 2.407 tag in my local repository and then did I get push minus minus tags? Unfortunately, that pushed four tags, not one, because I apparently had some latent tags sitting in my private repository, in my working repository, that were not on the remote. I promptly deleted three of the four tags because they were junk tags that didn't belong there. But I was worried that I might have damaged the build process. But Damien, you said that it had safety checks that it would not attempt to push a tag that was too old. So those safety checks worked. Exactly. The setup of the job on trusted CI, the current setup is, discovers all existing tags. If a tags is removed, then remove it immediately from the build history. And finally, don't build, don't trigger a build for tags older than three days. Which mean if by error someone changes a tag from the past, that's the situation you were in, that won't rebuild and override the existing image. The downside of that setting is that if you want to rebuild a tag for a security reason, you have to manually trigger the build to force the system to trigger it. Otherwise it won't pick it by itself. Thank you. Now that leads to the future topic of we need to append a build number suffix to the end of our tags. Because 2.407, while correct, is actually not giving us the facility to do a rebuild of something if we were to need to do it. Exactly. But that will be the topic of the SIG platform. But of course that's the next step. But now for the infrastructure part today, we have demonstrated that now we are not overriding the existing tags unless we specifically trigger a build and that allows the contributor to have their pull request merge way faster. In fact, that was not a mistake from you. That was a test of the actual working of that feature. That's well done. The difference between a mistake and a test is very slim in that case. Got it. And now that means for infrastructure the next step for us will be to see I propose we wait for tomorrow's LTS and after next week we discuss again the topic of automating the release, the part build docker container made from the release. That could be as simple for today as release CI once finished the packaging or maybe it can start earlier but at the moment in time the weekly release process should be able to create the tag on the repository by itself that would automatically trigger the build on trusted CI that could be a first step to avoid someone having to do it as part of the release process. So I propose we wait for next week for discussing this one. Good for you? Let me take note of that element. Do you have other announcements? Yes, 2.407 is the first release that will warn users of our CentOS 7 container image that the operating system that that container image is delivering will not be supported after November 16 2023. So you may see noise in online forums you may see noise in various places saying we're not going to support CentOS 7 anymore and the correct answer is yes, that is accurate. Beginning mid-November of 2023, the Jenkins project will no longer support CentOS 7. Yeah, it's Red Hat Enterprise Linux 7 is the base but the vast majority of users are probably actually using CentOS, not Red Hat, not Oracle Linux and not Scientific Linux. Nice. Expect. And there will be a blog post Oh, it is today, isn't it Bruno? So I'll publish the blog post. We've got approval and today was the targeted day for it. I was thinking the day was tomorrow but I'll publish that blog post right now. Thanks. I'm excited. It says that will help a lot on a lot of cases. We should add the link of the blog post on that comment. Here? Yes. Please help us, collaborative notes. I'll put it in there because the blog will be published within the next few minutes. Upcoming calendar next weekly, next week 2.408 We have an LTS tomorrow if I'm not mistaken. 2.401.1 So please don't break the infrastructure tomorrow. I haven't seen any adversaries published. So we should not expect public adversaries on the mailing. I don't know about next major events. So is there any announcement or calendar item or can we proceed to the operational tasks? Ok. So then the tasks that we were able to finish we invited a student from the JITSAC project to the Pliggin Hill School Repository and the Associated Team. We had to work on different areas so we weren't sure about the initial need but looks good for Adrien. So I've closed the issue. If there is something missing he will reopen give details and we'll fix it. A user lost their password so as usual that will be worth checking Google indexation of accounts checking say oh but we have more and more gain user that just send that kind of request. So if they search too quickly I guess that's correlated with the uprising of chat GPT and I'm afraid that maybe chat GPT could send people to account I'm not sure how to deal with that no I'm not really interested into doing things with chat GPT so anyone with good idea or knowledge or skills on that part may be my theory is wrong but if that's the case influencing chat GPT to tell them to not open not redirect users to open issues there could be a great great thing. Maybe we should ask chat GPT what to do to not have chat GPT. You can I'm not creating an account there anymore. Thanks Mark for handling that. Thanks Stefan thanks for walking on ensuring that Popeur's Chesh 2 bootstrap 4 removed from all of our controllers with all the involved chaos that we can have between the docker images the puppet manage the manually manage everywhere anything to add on that topic Stefan mais ce n'est pas l'aim de votre question, j'imagine. Non, c'était juste que j'ai oublié quelque chose. Je ne pense pas qu'il y ait un tasque ou un feedback ou un post mortem pour faire ce tasque. Cool, merci pour ça. Je vois que vous avez fermé la machine Azure RM64 virtual. C'est à dire que nous avons effectivement stoppé d'utiliser AWS for Infra-CI.io, including credential and plugin removal. C'est correct? Je pense que j'ai fait tout. Cool, merci. La prochaine étape pour moi, c'est de vérifier l'impact sur le AWS billing. Il doit probablement, plus tard cette semaine, ou le début de l'année, je pense que ce sera un petit jump comparé à l'impact sur le cost d'adaptation, mais je suis sûr que ce sera visible pour cette semaine. Un autre compte, c'est l'issue. Donc, jump. Autolink référence pour Core. Je n'ai aucune idée de ce que c'est. J'assume que c'est quelque chose qui a été fait entre Alex et Team. C'est correct? Oui, c'est correct. Et c'est travaillé bien. Les plugins 15 plus que j'ai maintenues sont maintenant utilisées en Autolinks, et j'ai confirmé qu'ils travaillent. Donc, pour ces plugins qui utilisent Jankin's Jira comme leur bug tracker, cela le fait un peu plus facile. La plupart de notre travail en Jankin Infra n'est pas traquée comme en Jankin's Jira. On a tenté d'utiliser des issues GitHub. Cela ne nous aide pas beaucoup, mais les maintaineurs de plugins comme moi sont bénéfits. Si ils choisissent d'évaluer eux-mêmes. C'est cool. Merci pour l'explanation. J'ai oublié ce que c'était. Maintenant, c'est plus clair avec l'explanation. Clean up and import and manage data monitoring in terraforms. Je vais être la voix de l'arrivée ici. Donc, merci d'arrivée, même si vous n'êtes pas là. Il y a eu plusieurs manières de créer des monitors data. Les monitors sont objectifs dans l'API et l'UI qui permettent de créer des conditions et des thresholds pour alerter quand ces thresholds sont crossés ou quand ces conditions sont mises. Par exemple, quand une machine virtual a un hard drive qui a atteint 80% de usage en termes de disk space, cela a créé un alerte. Cela a créé un alerte warner et si nous avons créé 90% de thresholds, cela a été un alerte de page utility. Nous avons recently started to use properly monitors and we have realised that somewhere as code somewhere created manually, somewhere created magically, so we did a big cleanup on that part. Everything is managed as code from our terraform repositories. Thanks for that big cleanup survey. We removed false positives and there was a hidden task behind based on the feedback from both Erwin and Stefan when we reached a new usage, a new node disk space left I don't remember on which service but we were still having a lot of free disk space and we all realised that it was a full on high nodes. So now thanks for from your actions folks, we have a monitor that not only monitor the free disk space but also the free high node per device that will alert us in the future to avoid blocking a service like that problem did. So nice iteration folks, I'm really proud. And one last issue account name. I need to spend more time on this one. Now jump on the work in progress. So as usual for each of these items, we will see if we continue working on this on the next milestone or if we need to put back on backlog or close it because that can happen. First one migration to the cluster public gates from the former cluster. The goal was to fight against a big error message on AKS that say hey, your network has an IP overlap that create a bunch of troubles and it did. So we are migrating services to the proper clusters. The status is that we migrated a few services during the past milestones including public health core that is now running properly on the new cluster. Here you can see the lists. So we have already migrated to the network. Key clock has been migrated last week without any problem. Incremental Publisher service which is a basically a web book receiver that receives messages from the pipeline library on CIG and Kinsayo that tells it to get the archived artifact and publish it to artifactory in incremental's repository. So that service is quite widely available. So we can migrate to the network. So we can migrate to the network and be careful with it. That has been migrated with 10 minutes downtime after letting everyone know that oh it was. By the way we should think about making it highly available. It's a stateless service so we were able to migrate plugin health coring. Here on state I'm just checking the last messages. Yes, plugin health score was done with also the help of Adrien that was way efficient to have the developer of the application to be there to help us on the probes. So thanks Adrien for that. We were able to remove deleted resources to not consume a lot of data. Just a point we had a former DNS record for javadoc. No it was for wiki. One of the two websites did add the old DNS record pointing to an IP instead of being a C name to the whatever Jenkins IO name. So that broke the service after deleting the resources. So we were able to fix Managers code now so we shouldn't have the issue anymore. And next step is to migrate work is in progress that will be done later today. Rating Jenkins IO which is used on the changelog pages of Jenkins IO when you have these icons that you click to vote and rate the releases. And uplink Jenkins IO which is used for telemetry sending from Jenkins controller all over the world so we can get some statistics. So both of these services are of the same topology. They have a PostgreSQL database on two pod replica running on the cluster. So we are going to migrate both of the same time in a few in one or two hours. I expect starting to work on the critical one held up and get Jenkins IO. But we will decide that with Irvi when it will be back from Jenkins IO we need to be full time for migrating these two critical services. So almost there any question? Ok so we continue working on this obviously until the new cluster and there's all the workloads and we can clean up all the former resources. Next one migrate trusted CI Jenkins IO from AWS to Azure so we worked both Stefan and I on this one Stefan was my rubber duck the world that has been migrated to the new virtual machine successfully. So the next step is to run the effective migration of trusted and see what happen. A few so the proposal date is Thursday first of June we might start tomorrow but as a good and really wise tip from Stefan maybe wait for not doing it a day of LTS even a few hours after because I'm sure the LTS will be done quickly so that's why the proposal for Thursday. Until then the next step will be taint the virtual machines that mean destroying and recreating machine from start to clean up every temple test trace that we might have left on the data disk that won't trash the data already migrated we want because the data disk won't be tainted on the virtual machines and then we will see what will be the next steps there might be some fine tuning afterwards especially on the security groups but now we have reached the same quality level and same rate as what we have on AWS so we should be able to proceed for the next steps so that's all that means during the operation the update center index and the repository permission of data jobs might not be able to run as expected so as a safety measure I will ensure that I will run both of the jobs on the current machine to have something fresh and updated particularly for the air pay and once it's built we stop the controller, do the migration and then try to run them separately most critical is the update center but that should be quick to build and republish and the second one will be the RP most of the issues we should have after that after that migration will be IP openings on different firewalls because it used to be a whole machine so the new IP that will be used by its agents and its controller have changed of course so for instance when we want to push to the update center virtual machine we might need to update the configuration so if you have word issues in publication of plugins update centers, if you see people complaining that their plugins is not up to date on plugin genkin.io starting from Thursday that will be the main reason any question no thanks for the alert do we need to and I assume that there will be some notice to people or will there be any notice to people yes the plan includes most probably that will be sent tomorrow one day before the operation status genkin.io will be updated and an email to the mailing list of genkin.io dev will be sent because that one has quite the impact so better communicating works I prefer letting people know good point thanks for the reminder that's important send email 31 of may one day prior all DNS has been set to with a TTL of 60 seconds for both operation cluster migration entrusted next task install and configure datadog plugin genkin.io so I worked a bit on how to make the datadog plugin installed inside the genkin.io container to communicate in UDP with the datadog agent running on the host machine behind it's mostly a question of setting up the agent to listen on the proper network interface so the container can reach it on the host and by default the agent only listen on localhost which is not available the localhost of a host machine is not available from a container so that's only a set of finding the proper pooped setup for the agent so it's agents.yaml configuration file will be updated to listen on the proper network interface airway told me will be able to continue working on this on the next milestone it did some successful manual test on the machine that have been overridden since then by the agents so it's in a good direction and we should be able to have way more information on the agent.yaml airway was pretty excited about this topic because that could help a lot on the bomb build slowness on the agent.yaml we had a question about adding a repository gitpack so a contributor should be able to build their own their plugin they I haven't had time to look on this one most probably we will act on short term by adding an exception on ACP so the build will instead of ACP train to get on our gifrock repository it will directly bypass ACP for that specific repository we need to help the user as soon as possible so if it's okay for everyone I will keep that issue on the upcoming milestone upgrade to Kubernetes 1.25 I've started reading the changelog we might have issues with the psp and a few depreciation so that need to be reviewed for workfully so I'm still currently trying to evaluate the impact on our clusters the idea will be planning for Kubernetes cluster upgrade next Monday for DOKS and CIK so the two clusters used by CIK because this one doesn't have anything related to high availability load balancer or persistent volume they don't have these free features so that could be easy to start with them. ACP is unreliable we didn't work on this one so that was a time management mistake for this one right now the next steps will be being able to have inbound agent for the Azure VM agent on CI agent Kinsayo I'm keeping this one because I'm sure Stefan and I can work on this later this week Ubuntu 2204 upgrade campaign I spent some time cleaning up two of our machines that are ready to be upgraded these machines are the 203 OSL machines meaning lettuce and edamame the proposal is migrating this one as soon as possible because they don't host any service so better upgrading them the idea is that if the upgrade goes properly that means we can immediately jump to the question Kinsayo which is also a virtual machine updated on OSU OSL why these free machines are special they are virtual machines or Bermuda I'm not sure I think it's virtual machine hosted by the OSU OSL organization which is a University of Oregon the question is if we upgrade in place the distribution and reboot will the kernel work with the virtual machine so if we lose edamame or lettuce that's not a problem and we will ask them and we will know what is required for the upgrade before breaking Puppet Jenkins so you that could be way more annoying Puppet Jenkins ready support Linux container when running Windows virtual machine oh I forgot this one install docker with docker desktop instead of docker c on windows packer images that should be a few lines and Stefan finally RM64 not pull on publicates to start using RM64 pods for the production website workload can you give up a heads up on this one yes I started a PR to define as oh sorry to define the new node pool and the main problem naturally a problem was to find the correct machine to use as a IRM node we worked a lot on disk size and memory and CPU to define the correct one I think we found we found the good one so it should it should go ahead now and maybe we will have to work a little on a new node pool for the Intel one to just rename the node pools and to have something more coherent yep homogeneous is that okay for you to prioritize once the node pool is created successfully because you know with Terraform and Azure we know that sometimes the plan says I should create these resources and when it's time to create them it fails with whatever error and then you have to iterate so once created successfully proposal is then you start working on how could we still Kubernetes to use IRM image for the Java doc and Genix image okay again the goal of that issue is to execute some of our workloads websites mostly static websites on IRM 64 machine instead of Intel so we can decrease the costs per request or the cost absolute cost of these workloads now I'm removing the backlog let's cover the triage and new issues mark I think you can start because I saw you open issue about bum issue bum problems earlier today is that correct yes so so and I think you mentioned something that may help in in your earlier comments I'm not sure so what we see is that attempts to release the Jenkins plugin bill of materials failed over the weekend on three different attempts each attempt failed taking 1,5 to up to nine hours to to attempt to do the run previous releases that had been successful with this configuration took six hours and so there may be some change that has caused things in the last seven or eight days to become slower we will take some actions on the bill of materials side right now we're supporting for release lines 361 375 387 401 we will very soon drop 361 so that should reduce our runtime somewhat right there immediately but there may have been other changes that are worth are worth further consideration in the infrateam right the the problem here may go away just by the changes will make on the bill of materials to reduce but if it doesn't will then need help ok so we still need to look at the logs and see what happen because between the eventually the spot the spot instances eviction rate that could have grown on these instances so we can check this and also yeah I fear that we will still be stuck with the concurrent for resources on CIG and Kinsayo when there is a bomb builds and when the step starts to take absolutely unexpected times for simple steps that should take seconds and that takes minutes I'm sure we are in that kind of locks so yeah for that one we need to focus on the new CIG and Kinsayo instance the first step and the work that we did is doing with the datadog thing we might have datadog helping us on monitoring what is going wrong in that controller great yeah so there are plans already for the bomb to take some actions Tim Jacquem even replied to my that one saying hey we don't even need to wait for the release of 2.401.1 and I think it's a valid a valid statement because 2.361 has no known security vulnerabilities and users should be running 375 or newer actually they should be running by now 387 to decrease time needs to investigate datadog plugin could help ok so I've added this one on the upcoming milestone now we have a request by Gevin for Matomo kit of docker repositories yes he created a repository and we removed it because we didn't add news for Montez on that topic even asking him so he recreated everything we have to edit this and I will take just scan the configuration state of that repository and everything I assume it's the part for replacing google analytics by Matomo which is back on the pipe that should help us not depending because it looks like even Olivier doesn't have access on google analytics so yeah he doesn't have enough permissions to grant me admin to migrate some of the properties these are objects inside the google analytics API as I understand so we will have to wait on july for automatic migration by google analytics themselves but that was a discussion and survey was willing to help to have our own Matomo service I understand that Olivier Verna run Matomo for the past 2 years for update CLI and gave in also on his own so that should be a service to host on our cluster the next step will be we need and I will ask explicitly gave in on here what do we need for running the cluster because there is no reason for hosting and building a docker image if we don't run it somewhere and we need to know the requirement for running Matomo in production so by default I'm adding it to the next milestone assuming that survey express some interest one or two weeks ago when we discuss that topic if I'm mistaken we will remove it and let gave in walk or just say no or move it on backlog that's not top priority but important because gave in is able to spend some time with us now for the bootstrap so better to use that precious time mark you open an issue about Artifactory bandwidth assessment oh I forgot to work on this today oh crap so the idea is after a meeting with Gfrog last week we have two brown out session to do so brown out will be us changing a major setting on the repositories and see the effects both the build on infrastructure and the build from the outside contributor brown out is between black out and I don't know is white out existing so between nominal condition and everything broken the goal is for one hour so we will let chosen know a few days ago that that day during one hour we will change that setting that might have that impact and will most probably break your builds because we want to see how it breaks the first one will be once validated with the one time with the Gits repository to see if we can disable the Maven repo one making it private only for Artifactory that's a repository is used on the virtual repository public that everyone should use we are not absolutely sure but it looks like that some users are directly calling that repository from repo.genkinci.org slash repo slide Maven repo one they should use public they could be two kind of usage for a direct call to that virtual repository the first one is misconfiguration of POM XML or settings XML and that one if it breaks it's ok because that need will need fixing by using public if it's a valid and expected use case and the second case is abusive use case that cost us bandwidth somehow so we need to check if we can disable Maven repo one so we will do a test on Gits and then a brown out of this one to see if it break things is that a good summary mark for the first step yes that is and then the second the second brown out will be removing Maven repo one from even the public virtual repository at all and see the impact but that one will need more details because one thing for sure is that if the abusive use case switch from the direct Maven repo one and they realize they can use public that will just shift the problem from one repo to the other to see why do we have mirror of Maven repo one today that one might need some fine tuning of the ACP though because ACP if they cannot find if it cannot find an artifact then it will need either to fail abruptly and then we fix the POM XML dependency or eventually directly have ACP downloading artifact from Maven repo one instead of our Giffrock repo and caching everything still to keep the caching on an infrastructure so we have to move this one to this week milestone because the goal will be to do the first brown outs this week is still okay for your mark if we talk gets Thursday or Friday yes yeah I think so I would have a preference for Friday if it's okay for you and Friday sounds great so Damien the Friday brown out it's still okay if we just do the J get one we think that's relatively low impact on the Jenkins project overall yes so jig it for Friday and based on the feedback we plan to do the brown out of Maven repo one the direct one next week is that okay for you that yeah that that should be fine Monday or Tuesday next week for the Maven repo one timing proposal discuss during well and it doesn't even have to be that early in the week next week because we won't have an LTS that week we could do it Wednesday oh yes but I would prefer doing the brown out as soon as possible so we can give feedback to Giffrock as soon as possible okay great she gets brown out Friday too yep Friday too brown out Maven Maven repo one brown out four the five of the order six is that okay yes yes that's the same thing Maven repo one we don't have administrative on the Maven repo one we would need more money clearly the dash cash is a suffix that implies that it's a caching of the original Maven repo one external mirror that's the same awarding I use abusive awarding for mentioning that mirror repository so we don't need more triage what do we have find a way to monitor job done from private controller yes that doesn't need triage and it's not the priority there was one that was raised in chat just minutes ago by Gavin Mogan on the 3602 did it already get covered it's matamo yes okay so I hadn't seen it in the notes and so I added it to the notes okay real time check perfect thanks yeah I did it milestone but I forgot to cherry pick to the notes let me add the assess artifactory so find a way to monitor job done for private controller it's a kind of next logical step of the CI Jenkins IO connection to the data.dog plugins that should give way more information to the data.dog about the internals of Jenkins the amount of job failing jobs that could allow us to monitor critical jobs for instance when the bomb takes more than 10 hours we could have a monitor on data.dog letting us know that's a practical example but for some private and sensitive controllers such as trusted CI and release CI as infrastructure officer I refuse to enable the data.dog plugin at that level of detail we can have data.dog plugins sending virtual machine matrix oh that sensitive virtual machine is using a lot of CPU that information is okay sending internals of Jenkins controller that could or could not depending on the accidentally set up by someone unexpected unexpected backup of credential in data.dog is a scenario that could happen that we don't want specifically for update center we don't want an unexpected backup of the update center certificates right so we need to find a way they used to be a proposer by Daniel that might have been an issue or private conversation I can't remember so I've chaired that with survey the idea will be to each of these sensitive jobs that we want to monitor will need a post build step that just write a few selected information the date when it run whatever information that are not sensitive inside I think we have a public bucket with file with json file use for the reports we can write without any risk for the safety of this controller the status of the latest update center build or RPU builds and then build the data.dog monitor that say if after 15 minutes the update center last build successful build then send an alert we can build that kind of 2 step process so that's not top priority but that will be really useful for us to track these jobs and helps developer because we can fix the element before error happen so that's why I propose we don't start working on this we don't have time for this milestone remove triage but that one is interesting to track jetpack we should not have triage anymore it's CIGEN Kinsayo and it's on the new milestone ok I don't see any new issues we are removed pullcredential from Kubernetes cluster so for that one I'm removing triage but not adding to the milestone because we won't have time credential ok so do we have new issues new element you want to speak about and add to the milestone or to triage or I got I got one last item that need to be tracked on a new issue I will take care of opening it and adding it we receive the pull requests from Alex I will sync with Alex for the implementation to see if we have to do it Alex and team are heavily working on weekly CIGEN Kinsayo which is a public instance and a public demonstrator of the new GENKINS design UXUI etc mainly the design language and they want to make anyone being able to have the system read so we can show the UI of the system administration the new UI which is a valid and legit use case thing is that could risk and that would risk people being able to access some encrypted credentials even if the credentials are encrypted that will give them some specific permission we are not completely sure but I'm not really willing to try because we are in a sensitive area giving read access to the system configuration should give you access to G-CASK export as far as I can tell which has an export of the encrypted credentials our credentials could be on some fields I don't know exactly how the permission work but as a matter of safety my proposal is I don't want to block that new thing but before I prefer stop using LDAP authentication for that instance and switch to the local Jenkins user database so we would have an admin and a shared password for the administration so the LDAP binding password there would be no chance to expose that password and the second credential that could be risky is the GitHub app token but as Tim said it's fine to have that risk there because it's a really fine-grained GitHub app so my proposal is to change the configuration of weekly CI so it doesn't use LDAP anymore and it doesn't use any credential on top level unless there will be public credentials for demonstration so then we couldn't have any risk of anyone accessing sensitive credentials here so the real concern the remaining concern seems to be that LDAP binding credential and that the only way to stop using that is to stop using LDAP as the authentication system exactly because on the paper that permission shouldn't expose credential on the paper but once the credential is exposed that's annoying because that's a public instance better and also I want to suggest to Tim and Alex that we could create a dummy agent so that will mean a SSH agent machine running outside Kubernetes that could be a tiny virtual machine somewhere if they need to demonstrate the UI of the nodes pages but that's not for now so yes that's all for me is there anything else you want to add non ok so then I'm gonna stop sharing my screen I'm gonna stop recording and for people watching this recording see you next week bye bye