 Hello everyone, welcome to the Jenkins Infrastructure Weekly Team Meeting. Today we are the 13th of February 2024. Around the virtual table we have myself, Demandu Portals, Stefan Merle, Novera Arthen, Kevin Martins, and we have a new attendee, Akash Mishra. Please note that AirVelomer and Markway are most probably unavailable today, so we're going to start without waiting for them. Let's get started. It's been two weeks, we cancelled last week due to the FOSDEM, so a lot of things. First of all, weekly releases. As we had 2.444 and 2.45 releases, I don't remember, there has been issues last week with the 2.444. I can't remember which one though, let me check the history. Do you want a little reminder? Yes, I broke something, sure. Yes, you deleted something that was needed, you know the... Oh yeah, file system. True that, so failed due to... Let me search the issue. Get the Jenkins IO, I believe we closed it right after. Which milestone was it? Oh, it wasn't in any milestone? Oh no. So then let's add to the closed milestone. The milestone we are going to close. Okay, let me take notes. Due to unforeseen consequences of file storage was removed during migration to premium storage. So what happened is that that issue showed that we were using a non-premium file storage, which is a shared file storage, which is baked by object storage behind, managed on Azure, and that can provide Samba or NFS multivariate access as a persistent volume to our systems. In that case, forget Jenkins IO, the mirror redirector. That storage was costing us around 2 to 3K per month, which was so much that we initially thought about let's just move to virtual machine or use a simple SSD that will be clearly cheaper. And we realized along the way that in fact we were built by usage, as you can see on that one. If we were using transaction optimized and transaction optimized, say, hey, you are built per snapshot, per transactions. While in the case of the new one, everything is included, particularly metadata, which means we were able to remove this 3K because the storage cost, even though a bit higher, 0.2 per gigabyte instead of 0.005. So it's fourth time the price for storage, but that's still a total of 200 per month instead of 2K per month. So it's a 10 time decrease. As part of this one, we had to create a brand new premium storage and migrate data, and there were two file storage. One was marked superficially at least as non-use since years. So I deleted it and never migrated it. The data within is already present on Archive Jenkins IO and PKG Jenkins IO. And the question, hey, what purpose does it serve? The purpose is that that share was used during the Jenkins core release, which broke last week release. So fixed by recreating the missing file storage, documenting it, and filling it with data from PKG.origin.jenkins.io VM. The goal of the survey because he pointed that out during the emergency. So good job. Thanks at Irvay and at Mark for handling this release. So thanks everyone. So then everything went fine and we didn't have the issue today. We had another issue though. So today's release, a release is out or let me see what's the status, the packaging bill failed due to an agent being deleted. I believe there was a network issue. Let's have a look at the checks. Okay, it's currently synchronizing with the mirror. So in a few minutes that should be okay. Let's go back to the notes. Packaging job failed due to a network transient error. It started almost finished and the Docker image is out. The tag has been created, the release has been created. So we are ready to update Docker-based image for the weekly release. Kevin, I don't know if there is something to run on a changelog for this week. I merged it and it's live now. So that should be all set. Oh, cool. Thanks. So end of this meeting, everything should be there. Is there any question on the weekly releases, the problem we had last week, this week or something to add? No, okay. Do you have something else on the announcement folks? I don't have either. So let's continue. The next weekly release will happen next week as usual, Tuesday, 20 February, 2024. That will be 2.446. I don't remember as usual, the LTS release, 21.0. So next week is an LTS week. So be careful next week folks, don't break the production. Do we have an announced Jenkins security advisory? None as far as I can tell. So that's a good thing. And next major events, Scalics in March. That's an opportunity for people living on the West Coast to be able to meet some of the Jenkins community member that will be there. At least Mark, I believe. Mark and Alisa most probably. Is there anything else on the calendar or announcements? No? Okay, let's roll. Add deprecated topic. So let's start with the job we were able to do during the past milestone, which was an exceptional milestone due to the cancellation of last week build. So it's two week milestone, that's a lot of element we were able to fix. So GitHub topic was added to one of the repository. Just to mention, we use the help desk for that, but these are managed by the Jenkins CI GitHub organization administrator, which we are not, but we use the Jenkins Infraild desk for that. We had four or four random errors on CI Jenkins SEO. So let me regroup the Datadog issue. We have had an issue with the version 6.0.0 of the Datadog plugin, which corrupted some build pages and build data. So we had to update to the 6.0.1, which stopped corrupting data, but was showing quite often stackover flow error requiring a controller restart. Finally, the 6.0.2 version of Datadog was able to stop this stackover flow. So now we are safe, but there have been multiple consequences. So Datadog plugin is failures. Let me rewrite that. If you have any question while I'm ordering the element, don't hesitate. Nope, 6.0. Generate made controller to corrupt build data during a stackoverflow error, fixed by corruption, fixed by stackoverflow error, fixed by 6.0.0 applied everywhere. We use it. Okay. I'm searching for the nine. Okay, this one. So the one first funny consequence of this was that all build and CI Jenkins are you were showing the epoch as the date because the build data was corrupted. So the default value when the Jenkins workloads, the build data is 1st January of 97, the epoch. So that one has been fixed. Thanks for the work of Erwe and Mark. They upgraded. They contacted the Datadog maintainer and exchanged. And one of the funny consequences is that we had the random for pages because some build had their build data corrupted. After we restarted the controller and the applying the changes, all of these error pages were gone. Finally, we realized with Erwe that the controller logs are not collected on CI Jenkins.io. We willingly disabled that like one year and a half ago and never went back on that area. But since we know Datadog offers and sponsors us with unlimited data and limited requests, that's a good opportunity now that we have removed everything sensitive from CI Jenkins.io. So that was an opportunity for us to start collecting these logs so we can correlate with the logs sent by the Datadog plugin itself. Controller versus build versus agent logs. I'm looking at all the other consequences. I think it's okay. Gentleman spawning on infrasia. Okay. There was another unforeseen consequence of that. Incremental publisher application was answering HTTP 400 client error when the builds on the pull request for different plugins were reaching success and were trying to tell the system to deploy the generated incremental builds to the specific repository. The reason is because that one uses CI Jenkins.io as a source of truth. And since the source of truth was an archived artifact of a build which was corrupted by the Datadog problem, that was a domino effect that killed the service. And that is on RPU archived artifact which were corrupted by Datadog. Fixed as RPU run again with success. Let's see the Jenkins. Next step. Let's start using reports Jenkins.io instead. So there is an issue for me. We started, so the idea is to use like for Infra reports and other services, the RPU repository permission updater is a job that run on trusted CI, the trusted and private controller which is able to publish to report Jenkins.io that is on file. These files are public, their content is public. There is nothing sensitive except the fact that we want this file not to be served by CI Jenkins.io. Instead, reports Jenkins.io is a static web server, a simple one which we can rely on highly available. So we started the work and now the report is generated and served so we now have to update incremental publisher. For the next release, I will have to write an issue. Is there any question on that topics? Things to add, things unclear. Okay, a note about incremental publisher. Thanks at airway and team for the huge amount of work you've put on incremental publisher that has been released at least seven times since the past four days for the incremental service upgrades. A lot of features, bug fix and improvements. Any question? Nope, okay. Thanks. So Alex was able to create a... Alex and airway were able to create a new category on their desk. So when you open an issue that is related to community Jenkins.io, such as, oh, I've been banned or I have an issue with my discourse account, then it can be a great help for us to categorize and try age these kind of issues. So thanks, Alex and airway. Oh, we don't have, okay. Next one, community Jenkins.io page view, exceed OSS plan. So Alex Brandes, part of the Jenkins board member was so that we were crossing some threshold in amount of users and accesses. Looks like it's fixed. I haven't checked in detail, but looks like they contacted discourse and they were able to increase our plan because they fully sponsor us. So thanks discourse and thanks Alex for taking care of this. Same kind of issue on GitHub. The Jenkins CI GitHub organization was consuming too much user seats compared to what was allowed for the sponsored plan. In fact, we are sponsored in an unlimited way. So either it was visual or it was just to contact them so they can increase virtually the amount of seats. So that has been closed. So thanks again for the administrator for taking care of this one. Is there any question? Nope, okay. In the top level topic, I'm just adding back the high storage issue. Just a word on get Jenkins.io. Because it created some errors. Hop, here we are. Believe I've added it. Because that spined up that error. So for some users, so that issue is what I mentioned about last week failed release. Some users during one or 12 or even more. So a 404 one trying to download the latest Jenkins, the weekly release which was broken due to so caused due to missing file. Storage C announcement section. So as we discussed, that has been taken care of by RV. Many thanks for that and Mark. So it was fixed during the same day and now we are back to, we are back to the new something working. Next issue, is there any question on this topic? This issue, so some users were blacklisted from download mirror. It's because on another issue that I will, let me try age some, the maintainer, that person is one of the maintainers of the university mirror that sponsored us by providing a file server that serve as one of the mirror. If you are close to Germany on network or geographically, most of the time your downloads of Jenkins war or HPI or GPA files comes from that file server. They saw, let's say, unpleasant usages just on a few files, old version of old plugin which looks like someone misconfigured their whole system. So they enable a fail to ban system which tend to blacklist or API rate limit the downloads when someone is doing an abuse. As a consequence of this, some user were misconfiguring their system. They were downloading a lot of plugins and they were falsely considered as a usage abuser. So they have ponderated the rules and everything is back to normal. But that was an opportunity to check that the Jenkins operator to install is one of the numerous way to install Jenkins on Kubernetes. And that operator does not allow to provide your custom built image with your plugins which mean on each Jenkins restart like the default setting of the Elm chart, Jenkins tried to read download all the plugins. If the plugins are already there, of course it's only checking that there is no new version but still that's a lot of downloads. So in that case, their user had 25 controller using the operator and they restarted all at the same time which was yes, a little peak of download. It's not catastrophic but still fail to one consider that as an abuse. So it had to be a fine dude. The good thing is that that allowed us that clearly allowed us to see, oh, there is an issue with the operator. So the maintainer of the Jenkins operator is working on that feature. And yeah, we know if you run Jenkins on Kubernetes please disable the download plugin feature and build your own image that will be easier for you. Is there any question? Nope, okay. Next issue. I believe that one is not related to us. And I think we need to change the status as not planned. Sorry. The user had DNS resolution issue and they were redirected to a mirror which their system wasn't able to get an IP from that name. So it's not something we can control, it's their system. So I'm adding it in the closed as not planned because there is nothing related to our work and there is nothing we can do for that user. Any question? Nope, okay, let's continue. So as a follow-up of the get Jenkins IO storage change to premium which broke the weekly release that was also an opportunity for us to automate using update CLI and data dog, two elements. First, since we define infrastructure on one system with Terraform which define the size of that file storage and then we specify a reference in Kubernetes management we have to keep in sync the some elements such as the size. We want Kubernetes to always be in sync of the size of the real file storage and have the same amount. So that has been automated by Yervian or Stefan. I don't remember exactly. And we also started to monitor the amount of data we use because that's volume is never cleaned up which is what we want. We want a limited persistence but we need to be sure that we only pay for what we consume. So we have an intermediate to find. We need to always have less than 80% of disk storage usage so we can eat best performances of the network system and the system storage. But we also need to increase the storage when we cross the 80% threshold. As such, we have added a problem data dog that alerts us when we reached that threshold. Is there any question or things to add? Nope, okay. Next issue, we had a problem with infra.ci while our private controller which wasn't able to spin up agent. So thanks Stefan for taking care of this one. May I let you explain in a one sentence the outcome? Okay, the problem was with the autoscaler on the node pool specifically on the IRM node pool, or IRM 64. And in fact, we did start the autoscaler from zero to 10, I think. And the problem was that zero, the Microsoft Azure is pushing people to use one and not zero. So for it to work, especially when you're using spot instances that take the risk to get killed. And then it will not spawn up the one that had been killed and you stay at zero. But if you spawn manually one, then it's working and triggering everything fine. With an issue open with Microsoft, we did a visual and some exchanges and they stick to the fact that we need to start at one. We did some math and the budget is okay. With small spot instances in that case, that was something we can afford. So we did start it at one. But before they closed the ticket, I told them that we solved the problem by starting at one, but that will be smart if we can start at zero. And if the autoscaler was able to to spawn up one none, one day is none. Nice job. Which means now we have way more responsive system for the ARM. And I will add that thanks to the work we are going to mention a bit later on migrating everything on ARM 64 all in one agents that had way more efficiency and sense to that choice. Because if all of our jobs are running on the same not pool, the probability to have a pod is at any moment is close to 100% and it's okay because we don't consume on other not pool for that instance. Agreed. Thanks for that work. Is there any question on this one? No, okay. Next one. We now have a nice API to expose for today the mirror list to our end user. So the list of mirror. So if you want to allow these IPs or domain name, you can use programmer particularly because we provide a static JSON file and that will be amended with, if we change these mirrors, if we add more, remove some and it's automatically generated. Click on the link at the bottom to show them. Cool. I was going, hey Stefan, do you want to guide me? Where's the link? Hervé did put it up here. Yeah, yeah. So it's very shunned of course, as you can see. We have one mirror, which is the fallback. Uh-oh. Oops. We are in the middle of a release so there might be something going wrong. I'm not sure what is the reference. Oh, there is a bug on the system. You know, two weeks ago I was saying everything was fine with the release and in fact it was not, so. Yep. Okay. So re-open the issue. We can re-open and add it to the new milestone. That's what I did two weeks ago. Live effect always good. Scene during team meeting. The list of mirrors only shows the fallback. Here we are. But you get the idea. That's a nice service. I'm sorry Hervé, I'm sorry. Each time we open that, we have to re-open the issue. Okay, so let's amend on the notes and add it in work in progress. Deploy, deploy is a mirrors meta two. What is the issue? Ah, let me get the link though. Okay. But there still is an unexpected bug. List is empty except for fallback. Okay. Is there any other question on this one? I would say it's still great job because we have an automation system that we will be able now to use as foundation to increase the amount of public information to provide as meta API, like GitHub provide their own public API. So we'll be able to have our public outbound API, et cetera. That would be awesome one day. One day, one day. Step by step. One last item. That's just something we did a few months ago. See you Bruno. Since we changed the fallback of these mirrors to archives junk in SAIO, that solved the long running issue of a user who was using a tool on Red Hat environment named Red Hat Satellites. And that system was scanning everything and trying to download all the packages to create their own private network. The problem is that the former fallback system, OSUSL servers, did not add all the data. They had a garbage collect system that was deleting holder artifact that were more than three or four years. So that's why since we were able to change archives and can say you to digital ocean and we have the bandwidth, we define archives as something where we have all the artifact we ever had been and as a fallback for download. So if you are trying to get an old file, the mirror system will redirect you to that fallback which solve effectively the issue of that user. So I closed it. Any question? I'm sorry, just, I'm not sure. Maybe I should shut up. The fallback is not up to date. All the time we still need to solve that problem, no? It's up to date after 30 minutes maximum. Yes, we still got that gap, okay, good. Exactly, but since the year the problem was for old files, it's already up to date since years. Old files are good, new files, 30 minutes gap, yes. Exactly, got it, but good point. Okay, I think that's all for the closed issue. We had a few issues closed as not planned. Most of the time it's either a user facing issue, user opening an help desk for their own Jenkins installation which is not our job to manage. Most of the time they have to be redirected to community Jenkins IOPolitely because there is nothing we can do to help them. The community is way more efficient than us for answering their issues. Finally, we have ACON tissues that were never answered as usual, so we closed them. Any question on the closed as not planned? Okay, so now work in progress. So we seen the download mirror, so there is an issue to be fixed. At least analyze. Stefan, I may help if you want. I mean, we can all get this one. We'll see later how do we dispatch the work. One server will be back. Is that okay for you for this one? Yes, it was pleasure. Then thanks, Alex and Mark for taking care of the Jira updates. So that consists on opening an issue for the Linux Foundation as they manage Jira instance for us. And like we did last year, every year we have an LTS upgrade to have because 9.4.x baseline will be under life in November or October this year. And it's the change on to the latest LTS on the nine line. So nothing to say here. Waiting for LF to give us an operation date. Nothing for us here and we will manage this. Any question? Then unexpected delays, building small plug-in on Linux agent. That's a long running issue. Some agents were when spawned on digital ocean had trouble when downloading issue from our caching proxy. We disabled the usage of digital ocean. And right now we consider that we need to start upgrading to Kubernetes 1.27. Usually that kind of trouble upon the one trimester before digital ocean drop given Kubernetes line. And of February marks the end of the support for Kubernetes 1.26 on digital ocean. As such, that's why we think migrating to 1.27 will clean up the virtual machine, the cluster and will most probably solve the performance issue. If not, then we will have to diagnose and eventually change the way we use digital ocean. Require Kubernetes.27 C below in new items. I planned this during this milestone. We will add it on next milestone after covering the current in progress. Is that okay for you? Yes. Digital ocean Kubernetes agent are still disabled on CI Jenkins.io Not important one. I tried to update the GNL PR argument as arguments as pointed by Bazil because we are using deprecated forms of some arguments or deprecated but still functional. Failed or Azure VM need to investigate. I've seen different possibility and errors but we need to check what is the agent the virtual machine are created but the agent fails to connect back to the controllers on both cases when we have containers and when we have virtual machine, both Windows and Linux. Azure containers, both Linux and Windows so need to investigate on a flying scenario to not break the production. If unless someone want to play with this I'm continuing my work here. No question. Open VPN revoke. Nothing done here. Let's check if we can diagnose this this week or we will drop it. That's about revoking open VPN certificate. Most probably there is documentation built by Olivier Verneur but someone need to check. He did the first superficial check and wasn't able to find something obvious which means we might have missing things here around revoking former certificate. We can revoke one person certificate if we want to get rid of that person from the VPN access but revoking whole certificates can still be a problem. Let's check and try with local installation to see. Most probably we need to use a CRM. Most probably here but it is a question of scope. We can revoke a user but can we revoke former certs of a valid user easily? No question? Nope, okay. Thanks for the person reading my notes. That helps a lot. Versioning docs. Check with the survey yesterday. Need to continue working on this one. On this one. First for them. That was not the top priority and that part being somewhere. I know some tiny things were done but some work need to be applied and every was missing time due to the first day. Is there anything else to add on your side? Kevin, on that one? No, nothing to add there. Just working on some issues and stuff that we found. Cool, thanks. Uplink, we still have still corrupted database records. We have slow running requests to find them all but that takes 20 to 40 hours to do the dichotomy so that takes time. So once a day, one request a day keeps the corruption away. Continuing working on this one. We have an effect. Each time we remove a corrupted network we just have 10 to 20 compared to the 100 of billions records on that database. So we just removed 10 to 20 records and that solved the issue and then the seek and find continues. So now we're able to download 3 megabytes of logs. Initially we're only blocked at 200K and it's only on 24 and 25 December of last year. Christmas gift, yeah. If you have data to find on other days, that's fine. Just these two days are corrupted. So as discussed, with Daniel, maybe that should be okay but I would want to remove the corruption. We are still not able to understand why these records were corrupted. No logs are available on Microsoft ProgressQL site. No question? Okay, next item. Intermittent out of memory Java 21 builds. Nothing done. Still need to be investigated. Still need to be investigated. Still need to be investigated. We still need to be investigated. Still need to be investigated. Still need to be investigated. Not sure why. Maybe it was fixed. Maybe it was development. Maybe it's on the infrastructure. We need to investigate that one. Most probably we'll see that coming with more priority when we will start the switch to GDK 21 for controller and agent. But right now it's GDK 21 for builds. So it's different concerns and different usage. Next one is on me. Migration leftover from public gates to RM64. Migration leftover from public gates to RM64. So next step is held up. Next step is held up. Need to plan migration of persistent data. Need to plan migration of persistent data. As RM64 VMs are in a different AZ availability zone than x86. So that means we need to migrate the data volume to a new one which will be multi-availability on replicated. The cost overhead is close to nothing. But we need a migration to a geo zone replicated persistent volume. And then migrate held up to RM64. The image is working already on RM64. So the data is the problem here. Next candidate then key clock and still no news from Miror bit maintainer. I tried to catch Jean-Baptiste Kamp at the first them but they didn't have I wasn't successful to get his attention and there were too much people around him. So no news. We will ping them because one of the solution for us will be to fork and maintain our own fork of mirror bits given no commits are done here will be more efficient on maintaining our own copy of this system. That also include challenging the usage of mirror bits in the future as well. There are other solutions from the shoes world and the open source is a good thing. So let's see where we already have enough work right now. Stefan, can you explain what's the status on migration to RM64? Yes I'm fighting with I think the last image which is bundler sorry it's not bundler the name of the image it's forgot that name but there is Ruby inside because it's installing bundler and right now that's a nightmare but I think that Hervé points my problem within another issue and that problem was related to the path for ASDF sorry and in fact we were using variable and those variables were not interpolated. So we had to fight and the whole team had to fight for the regex instead to manage a way for the path to be updated correctly but still a work in progress the last build work almost everywhere except on windows so I launched another build because most of the time with windows that's just a flaggy issue and that's not related because we didn't install that on windows so I think for that part at least it's going in a good way Nice I believe the next candidates after infrareports will be other consumers of the docker builder which is the one you try to migrate exactly to the system I think Poupet will be first in line soon enough but yeah cool Poupet we need to update it too Reminder long term spin up a new AKS cluster only for infra.ci agents on the Azure Sponsored subscription the goal will be to separate controller and agent like we do we see agent in SAO and we will pay for the agent on the new system since we won't have spots available I haven't had any answer I need to re-request a third time we will be able to scale to zero Any question on this one? As a reminder I had to check with the CDF the cost last week in details but that work on RM64 has been proven to be a gain of 5 to 7% on the last two last month builds on Azure so thanks An old issue which was consecutive so let me get back to get Genkin SAO here we are still that file storage migration to premium system related to now we have used a premium file storage we see improved times on loading the list of previous releases when you go to Genkin SAO you go to download and you want to see the list of all the former weekly downloads that request is a slow request on the file storage because it's using Samba on a system which is baked by object storage and listing content of a file container inside these kind of storages is not an efficient operation unless you do some tricks there were two solutions the first one tried NFS version 4 premium only because right now it's NFS 3.1 if we want to use NFS which is the case instead of CEFS in the AKS CSI driver but failed on the two first attempts so even following there were a permission issue I'm not sure which one how so I focused on relating to the premium file storage but that could be an improvement because NFS v4 with the proper client side tuning on the CSI driver might help on that area but not sure, I cannot be sure because still the backend system are not directly disks you have an intermediate system of the storage however airbay emitted raised a long running question also raised by Mark two years ago why not generating these HTML pages on each core release I mean that's a static page that lists the list of existing releases so if when we do a weekly or an LTS release a core release in general we regenerate that page and copy that file on the mirror system and we're done that could be also redirected to Jen Kinsayo there are multiple solutions here so airbay had that issue at that proposal raised by Mark a few years ago Kevin there is no expectation here but I'm mentioning this because that could be something we can do on Jen Kinsayo either update CLI could help automate that there are at least four different ways of doing it it used to be done by Olivier four or five years ago I think Tim was there and worked with him on that element if needed that's really ancient history during the generation they were spinning up a docker container with Apache within serving the files they did the slow request one time and Kerl exported the HTML file which then they deployed to the to the production so that could be as simple as this one I believe airbay will be really interested on driving this but I don't want to speak on this balf so I will let him confirm until next week so I'm going to keep that issue on the milestone and if we don't have time or if airbay is not interested it will go back to somewhere lost in our backlog any question okay next one which is the most important and most blocking one to replace blobxfer and by azcopy that one is blocking update center migration and it was creating a lot of additional network requests to the file storage hopefully we don't have the transaction problem anymore still that will improve the performances and the security of how we manage credential for writing to this file storage good news airbay was able to find a way to generate short-lived asas token using azure service principle so that's a lot of big words to say a he was able to say with an account which credential don't need to be renewed every month when we need to deploy files we can generate a token a short-lived token allow the builds copying files for one hour to copy or read files on these systems which is really useful because we don't have to manage that credential and that credential if exposed will be invalid automatically in one hour maximum so that protect us without having us to renew every month or three months the tokens that's really interesting and in the future we could use azure instance identity which is the same thing as a service principle instead it doesn't have any credential at all inside Jenkins and instead it's authenticated by azure cloud based on the instance either container or virtual machine if it has the permissions so we have numerous ways here so next step next steps so right now work in progress on contributors contributors.jenkins.io next step updates Jenkins.io which is the new update center system and the others if it work if it work long-term pkg.origin.jenkins.io virtual machine so that will be after the next steps that machine is using Blobix for every five minutes to synchronize the metadata of the update center so we'll need to use az copy instead any question? so immediately I can write blocks the update center evaluation we have asked the Jenkins security team for review but we need to migrate that update Jenkins.io to a new premium storage with that new token system so right now it's blocked by finishing the Blobix for az copy test okay for everyone finally we've already broken back an old issue we have a lot of jobs where we want to have two pipelines for a given project one pipeline is the project itself and the second pipeline is update CLI like we used to do on GitHub Actions and the goal is to have to separate jobs so we want to split them so if update CLI fails it doesn't block the main pipeline and it's also easier to use the graph viewer because otherwise you know Terraform jobs it breaks the graph view and that's really hard for us to use on day-to-day so by separating different jobs for the given project that solved that issue was making usage of GraphView hard for us Terraform Packer Occupant is written and Packer Jobs on Infra CI Long running task we have a lot of jobs to cover and we are missing automation so one or two jobs max per week and we do this on the long term WIP currently is on Azure and Packer Azure job has been created for update CLI so run every task and split the pipelines any question on this one ok perfect now new issues Stefan you can start with Kubernetes Yes I plan to start at least with the digital ocean upgrade as there is not really any risk we don't, we disabled it the both of them because we have two clusters one for the ACP which is public and one for the agents and both of them are not used so I will do the upgrades during that milestone and that's good because they are the first one to end this old version 1.26 2.4 1.4 digital ocean clusters has not used risk is zero and then I discovered that the version used for the client on the all-in-one agent was not correct so that was a double effect for the upgrade nice one on all-in-one image incorrect thanks to this nice and next candidate IWSCKS clusters yeah we'll let this one to cave in cool oh we have the azure production cluster azure you just click on the button and you wait for everything to be read and then you blame the network always blame the network no usually you blame the DNS or the regex yeah regex is also a good candidate to be blamed okay so we remove triage, I will add the recovery label later, do we have other triage issue? yes goulang we need to improve the way we manage it right now we track version of goulang but we track based on the latest available right here we are so stefan I let you continue explaining what this issue is about I need to remember we need to match the same version of goulang between the one in the one image and we use it so the shared tools the shared pipeline library and the shared tools so we cannot have the last one in one side and an old one on the other one so we need to provide the same one at the same time okay today it's all for me um otherwise otherwise we break shared tools and open vpn when upgrading oh we stop upgrading and that's fine yeah um for that one I don't mind to help I will let you drive since I see you have assigned yourself stefan maybe I didn't assign myself maybe you did no nope five days ago I was sick something don't lie as a reminder we already do that kind of trick for tracking the genkin's tool version of maven for the installer we have an update cli manifest with two sources the first source is get the version of the old one image we have in production today which is this one that's literally the production and from this you go to the packer image repository source for that specific tagged version which is in production and you list the tools version yaml file then with a request on the in that case maven version so in your case in the go long version you will be able to get what go long version is currently in production I think we have the same kind of update cli manifest for the version of agents within the g-cask definition for absolutely absolutely so that's the same pattern which means once we have a graded go long on the all in one image then all the project using go long and updating go long will then use that kind of trick to say oh I see a new version then let's try to update the go long dependencies based on what we have in production so the packer one is the truth absolutely so am I adding this on the next milestone I think I saw one last triage issue at the new private Kubernetes in the new spawns oh that one is for Stefan he already did the trick and Stefan you already wrote the issue for the new cluster for infra-ci yes I forgot I work sometimes what do you mean so let me add the reference here okay I've added it to the next milestone is that okay for you why not and finally we have one last triage issue I need to diagnose okay need to diagnose and we have added the new cluster I'm adding myself on the new cluster if it's okay for you Stefan no it's okay I'm ascending myself and I will fish you and say hey come work with me is that okay for you yes can I can I do the same with Kevin fish if it's okay yes I don't mind I'll work with you Stefan no worries there cool there is a problem with that indentation yep no worries it works I think that's all for the issues I don't see new issues let's remove trade from this one and add it to the milestone good is there new issues or thing you want to to discuss concerns or whatever I don't see any okay is there any question or topic you want to bring to that meeting do you have any question before we close thank you so then we'll see each other next week once the recording will be available I will publish them on YouTube and the one from the previous meeting as well and I hope you will and I wish you everyone a nice end of week take care bye bye