 Hello everyone, welcome to the Jenkins infrastructure meeting, weekly meeting, we are the 16th of May 2023, today around the table we get myself time at the portal Erwelemer, Mark White, Stéphane Merle, Bruno Verrard, and Kevin Isner there, let me clean up, let's get started with the announcement. So the weekly version of Jenkins 2.405 has been released successfully, I saw the changelog was published, Mark, is that correct? Correct, and the container image I verified, you published and I verified, so very pleased with that. So the new process was creating a tag work as expected, it's picked in the upcoming five minutes. I only made a minor mistake on the Linux builds that has been fixed, thanks for the review, so next weekly it should work properly. So the next step with new system, so since we have the release documentation updates, the next step will be us infrastructure person to communicate with the security team to update their documentation so they can do the same and they know the prerequisites. Next time they will have to issue a security advisory on a core version. And we need to start the discussion or actions related to, I'll explain this question about should we add the release GitHub team as a maintainer of the Jenkins CI Docker controller image so they should be able to create tag. That is question, I don't know the answer, but we have to start discussing the pro and cons of this one and find a solution, because that's obvious that the team need right access to the repository, or need a way to issue the tag when the release is finished. That could be for automation that could be for numerous elements. But we got the first step. Why am I mentioning this during the infrastructure SIG meeting? It's because for us infrastructure, we, we kept having issues with the image being rebuilt and published changing the checksum. And that impacts us in our usage is for CI Jenkins. I trusted certain release CI four of our five controller six controllers, but also it impacts us in terms of a supply chain security. So now that we build a tag, we can start smart thing. I remember herve asking for Sbom capability. So publishing the Sbom of a given build. Now that one has set the ground for starting working for real on this one it's ready to go. So anyone interested in contributing that to the Jenkins controller image can get started. Is there anything specific with that weekly release Mark? No, we've, we've still got a few items on the checklist that need to be run, but it looks looks like it's all going to be just fine. Okay. So that mean we are ready to deploy that that weekly core release to our systems for weekly CI Jenkins Ion and for CI. Second announcements today we had a security advisory. That one fine as usual advisory. It was plugins only. That one, we had some plugins that were subject to the advisory. So it has been updated on CI Jenkins Ion, which is public instance already did the trusted. And third CI updates. I saw pull request merge on our docker Jenkins weekly image. That was done for that specific reason I did the same during the past advisory two weeks ago. And the last one will be Jenkins dash LTS to do. LTS remaining to deploy with the. With the plugins advisory for for so for specifically for docker Jenkins weekly image. That will be both the weekly core release and the plugins update for the core advisory. So ideally we should try to deploy it later today. Is it okay Stefan, are you will you will be able to do it after that meeting. Yes, yes. Sorry, I was checking on the narrow on the. We need to to upgrade on the Kubernetes management, but that's not related to the CLI. Okay, update done. So that's all for the security advisory. Do you have any other announcement or do you have question or things to add on the security advisory topic or announcement topic. Cool. Let's have a look at the upcoming calendar next weekly. 2.406 Will happen. 33 May. As usual. I don't remember the next LTS. 2.401.1 401.1 Release candidate tomorrow. Tomorrow. Release in two weeks. Which I think is 31 May. Cool. Thanks Mark. And Alexander Brandes is the release lead. Alex. Is release lead. Cool. So the, okay. Perfect. We add. So security advisory, we add one today announced yesterday. Next major events. Do you have conferences, events, major elements and upcoming weeks. Not in May that I recall. Okay. We missed the one in Bruxelles that we received tickets from last year to this year. But they told us today so it's too late for us, but they charge whatsoever. Nice. So then let's start with the walk. We were able to do. During the past milestone. So we. We had those minor fix ups on weekly Jen Kinci.io. Thanks Alex for taking care of that so. So. Just a word. Weekly CIG. In case you is a public demonstrator, Based on the weekly code release with some. You, Weeks, new language elements and some features. And it appears that's some setup where done manually during the path. Mon sense. So with the work currently being done by RV and migrating instances. Oui, merci, Alex. Vous avez vu cela et vous avez pu contribuer à la configuration de G-CASQ, pour que ça soit bien. Est-ce que mon assaisement est correct, Hervé, ou avez-vous d'autres éléments à ajouter sur celui-là ? Je ne comprends pas pourquoi. Nous avons vu l'HTML correctement sous l'année dernière, ce que nous avons connecté à l'HELDAP. Je ne comprends pas pourquoi. Marc, ce serait peut-être une feature de sécurité, ou une feature d'accident. Nous n'avons oublié d'ouvrir la file, avec la nouvelle IP pour la nouvelle instance. C'était une erreur, quand l'on connectait à l'HELDAP, avec l'erreur de l'accident. Et l'arrondissement du top banner, le message custom écrit en HTML, était écrit en plaintexte. Lorsque nous avons fixé l'HELDAP port, nous avons commencé, après la première login, l'HTML était réel et visible, et rendu en HTML. Le format de la configuration, a été disrupté par l'initialité de l'HELDAP. C'est intéressant. C'est étrange, que l'HTML n'était pas sous l'année dernière, et je ne comprends pas pourquoi. Ok, je suis un de les causes de la configuration manuale, de la configuration manuale, et quand j'ai vu que les choses étaient flottées, j'ai vécu la page de la configuration, et j'ai switché l'HTML, le format, de n'importe quel format, pour valider l'HTML. Et aujourd'hui, l'HTML n'était pas sous l'année dernière, et c'était l'année de l'été... Oui, et je ne comprends pas. J'ai compris, ce serait complexe. Je ne pense pas que c'est très important, mais oui, c'est... Ce serait un mix de manuale et de la configuration manuale. On peut avoir l'impression que c'est relative à l'HTML, mais ça ne peut pas. On n'a pas de configuration manuale à l'année dernière. Oui, mais on ne sait pas si c'est possible ou si ce n'est pas possible. Si quelqu'un peut le faire, il peut être Marc, mais je ne pense pas que ça. Si je vois un bloc, je vais le fixer. Exactement. Nous avons tous l'esprit. Ok, merci. C'était le mot. Maintenant, c'est fixé. Alex confirme que le week-end est dans le state qu'on expect. Donc, ça devrait être ok. Closed issue about huge cloud cost due to outbound bandwidth. So it's been now three weeks. And the metrics on Azure shows that's the adding S3 artifact manager on CI Genkin SAIO clearly decreased the outbound bandwidth. We still have a bit, but it's way, way different. And now it's sustainable. Closed issue. We had someone blocked by the spam account. More and more we have users that say they are blocked by the spam account. And every time when you look in the logs in Datadog for the account app, you see the reason is cookie because these persons tried to create an account in the past five or 15 minutes with the same web browser and they never logged out. So when they try to create a second account, there is already a cookie on account app which is detected by the backends and refused to avoid someone trying to batch create accounts on the same session. Each time we see that it's a user who tried something, add an error or did a typo on the email, that kind of thing. I wanted to share that piece of knowledge with all of you folks. A long-term solution will be getting rid of accounts check inside your application and use something which is built for that key clock or whatever application could that, but that's a long-term future. We have plenty of things to do until then. In short term, we could also remove this spam course from Quentap and log when it's secure when there is a cookie detection but not block spam. I don't see a problem of keeping it. The current situation is fine. For me, it's an indicator of people not reading instruction properly. So, I mean, you cannot fight against that except using an application that has a user path which is way easier for them. In that case, I don't see a reason for finding a solution for people who mis-typed their email in a field. OK. But maybe I'm wrong. It's just a proposal. If you feel like it's a lot of spam account, then yeah, maybe we could start thinking but here most of the time, whether they retry 24 hours later or... Yeah. I saw one user I discussed in one-on-one that someone with web browser and tons of tabs and they never kicked their web browser which never earned the session. So even trying 24 hours after since there is no session cleanup, I thought that web browser had a TTL so maybe our cookie doesn't have time to live. I don't know where to work on web browser site. That could be a solution fixing the cookie detection by adding a time to live of one or two hours for the session. Build aborting, Cade scaling down. That was a user error that has been fixed. The nuggets of knowledge here is when build plugin pipeline library method is used by default it has a fail-fast parameter enabled which means any failure on one of the branches will immediately stop all the other branches. And user can tune that the parameter is exposed through the function but that might lead to word behavior like this because here we have a pipeline specialist or Jenkins iExpert and even that expert was coked on that trap. So just be aware, build plugin fails fast. So if you have a failure on one of the branches that kills the world built that makes sense on infrastructure point of view of course because we don't consume we don't want to consume too much machine time. Hervé so you finished the task at launchable to agent and you were able to update pipeline library is that correct and are there other tasks or feedback about that topic? We have to open issue on repository to tell them my head issues when using nano server image but nothing on ortho okay okay so you do that but yeah okay cool so does that mean that everything has been handed over to Basile explicitly and Basile is a term that is now doing whatever he does is that correct I have seen confirmation in return but yeah I told him that it was ready on your side cool thanks account issue so I'm passing it temporary name resolution failure plugin bomb built so that one I took on me to close it because the work you did folks last week allowed to remove that error I haven't seen that error anymore on bomb built we haven't things we haven't fixed the root cause issue but now it's not present anymore it's okay the root cause is the core DNS components the local DNS server inside the CIK on AWS was either having too much issues or had temporary failure to resolve outside domain names we were able to track that the data dog agent on that cluster were absolutely I don't know the English word but they were sending a lot of requests to core DNS and because our data dog agent were trying all of our services as part of the data dog probs that's the historical system so everyone worked on this part and was able to disable the data dog as a team we discussed that topic and we decided that the clusters where CIG and Kinsai should not have data dog probs for our services particularly since the past two years data dog introduced something named synthetics that allow data dog to run their own probs on their own system you can select regions and cloud provider on different location on the world so there is less good reasons for us to run our own probs and consuming CPU and networks so by removing the data dog agent custom probs we were able to decrease the load on core DNS that's why the issue is not there anymore if the issue happen again then we will have a deep dive on the core DNS component because we saw where there are but error did not appeared on each time so that was yeah we had previously been installing the data dog agent on every pod or every container that we were starting inside the Kubernetes cluster as a ci.jankins.io agent on machines but only at the machine level not at every container interesting and that was making enough each machine that set of all machines so in the case of bomb that could be 100 or more machines it could be very large but they were enough to overwhelm the DNS interesting cool another account you disk space for system pool that one was a tricky one so as pointed by stefan a few weeks ago after the work that after the initial clean up work on data dog we had some monitor that were alerting us that we passed the 80% threshold of disk usage that threshold is important on Linux because it means your performances are decreasing even with the SSD most of the time you need 20% free on your disk at least for the system pool on that cluster that was a word because that's the default system pool and when using terraform or azure if you change the default system pool it want to recreate a cluster because the life cycle is tied however we were able to find a trick solution by creating a secondary system pool draining everything then removing the old one everything was done manually either azure command line or ui and at the end if you change only the naming on terraform terraform is tricked into thinking the current system pool which is the first one on the list of system pool is the default one just like this it goes so we were able to increase the disk space for system pool and with deep dive we don't pay more for these disks because we have set up ephemeral machines so we use now all the available ephemeral storage for the OS disk in the context of Kubernetes the OS disk it does not survive a virtual machine restart or reschedule but we don't care because that's Kubernetes without a scaling so that's why we use that one Jenkins CI filling for Jenkins plugin after changes in Jenkins file I don't remember this one but it's closed I think user had issues ok the user was opening pool request and committing with a GitHub user name and Git user name difference than one of the main denners so of course if they were trying to change the Jenkins file our security policy on CI Jenkins IEU did not take in account the pool request modified Jenkins file because the user was seen as entrusted and the user tried to open the pool request with his GitHub account name but the commit was still with his former Git user name so same issue Jenkins does deep dive inspection at the commit level for the authors so that's good that means the mechanism still works as expected that's a positive sign another account issue with wrong email apital so we have a plugin maintainer we did a bit more than expected for the infrascope but we helped that user to get access to be able to release their plugin the trick here is that apital is an hold plugin and that hasn't been updated since years and the associated technical account was part of the there was a security issue mark in 2020 before I joined and a lot of held up account that were marked as used since months or years were disabled on key clock and held up sites and so the consequences of that were still there that was creating a set of minor issues and also the user wanted to do a manual release of the plugin so that's not the best idea they had issues with configuring their Maven and alas for them they were essential in discovering that Artifactory UI changed thanks Bruno, thanks Marc for taking care of that our documentation on Jenkins IO for developer need to be updated as soon as possible at least removing the UI steps for now and we need to discuss with Gfrog because you cannot get the Maven configuration settings file through the UI you must use the curl command line that's to take away so Marc we have to take appointment with Gfrog to discuss that with them as already discussed together because if we have that issue I let that thinking if Gfrog ask us for a reliable authentication for anyone using Artifactory that will be a nightmare for us in terms of support I had missed but I think you're correct we need to change the page on Jenkins.io to acknowledge that the UI instructions as listed no longer work you must use this other step it's a valid point the goal is to avoid confusion and then we can iterate dropping a note we use to do it with the UI it's currently broken on Artifactory that might come soon a note like this so everyone know you have to use command line go thanks survey looks like you didn't have an issue on migrating the bot's application from system pool to Linux pool nothing to report on that one no ok so the goal is to move our application workload on non-system pools because we want to have as less things as possible we had issue with the credential for Artifactory Maven I'm asking for a couldn't review on repository permission update I want to open a pool request a few months ago there isn't a clear consensus but my proposal is to do at least one retry when the job fails because the job run every 4 or 6 hours so if the job fails and no one notize and trigger a new one that mean we reach the automatic CD tokens for Artifactory they reach their end of life during 1 to 3 hours leading to that kind of issues we have one some time to time most of the time it's ok but my proposal is to do only one retry Daniel did not truly object it but say it will be better to have this built in inside the RPU system because most of the time it fails because Github being in holidays like last week it's just I don't have the bandwidth with the knowledge to do it built inside the RPU Java application so right now I propose we had the retry with a comment in the Jenkins file for us avoiding that kind of support if it fails 2 times in a row then it means there is an issue and then it's monitoring part and if anyone is interested in contributing that's natively inside the application please do help us and finally that one was tricky I took that with Maven 3 being able to specify a Maven repository for artifact that is local was forbidden or deprecated it appears I was mistaken so the whole time when we built and specify the ACP system I did not even talked about that case but it appear we have it so Maven 3 officially you can specify a local file pass URL to a local repository and it's added in the list and of course it work like every other artifact repositories meaning by default it was procced by ACP so the user are at their build failing so the idea was to define a technical ID so thanks for the help on that part we should now document this one but if you need a local repository with files you need to use the specific ID that is excluded from ACP on our system any question points on that one we had one issue closed as not planned someone asked for a password reset and then never answered back usually people don't care with their own Jenkins that happened now back to work in progress install and configure Datadog plugin on CI Jenkins so just to note we didn't discuss about this one last week that appear during the week but that's a really nice idea because that will allow us to monitor a lot of things on CI Jenkins we are not using Datadog on CI Jenkins for the specific Jenkins matrix we used to have the Prometheus and plugin matrix we need to check this plugin that has been removed because we removed the Prometheus platform and now we saw an infra CI that's the data in Datadog when sent by the native plugin are really useful and provide additional information for us in terms of observability of Jenkins that could help on numerous topics including the bomb slowness the fact that we cannot have 300 bomb parallel steps at the same time otherwise CI Jenkins just dies for one hour but all the topics so that's why we have added that as part of the milestone because it's essential for us to observe that what's the status of that task it's not working it's not working ok non si Datadog agent is running on the host or if the controller is running inside the docker container so we have to network my friend to let them communicate controller Datadog agent network connection if you need help on that part please ask our usual channels but yeah it's in the good direction because you already set up the world Jenkins configuration as code that's opt in only enable for CI we don't enable it for trusted answer a lot of work and working work we don't want to send data from within the certain trusted controllers but we want for CI Jenkins so that's nice work let's continue on this one is it ok for you to keep working on that on the upcoming milestone survey cool I might provide help and ideas either on the network or socket part yes that you did the hard work it's a new matter of finding the right network path so that's not a lot of config and there is no blocker here from my point of view so nice work and let's continue CI Jenkins you use a new VM instance type so the new virtual machine and it's environment has been created earlier today currently working on running Poupet agent so the goal is to have a generation 2 virtual machine that cost a bit less than the current one with better CPU performances and better system disk performances we'll see if that one helps on the bomb area that might be issues due to the HDD system and the OS disk not being healthy on the current CI Jenkins IO machine under the hood it's also a way to have CI Jenkins IO migrated to Ubuntu 22.04 and on the same network as the ACP and the new cluster that are working on so closer network less latency a lot of a lot of improvements I expect continue working on this one and being able to plan migration either on the upcoming milestone or in two weeks maximum quite optimistic on that part that will require an interruption on CI Jenkins IO for doing for doing the whole migration but that will be one hour no more no less Stefan, you did that task over to me for the upcoming milestone but may I ask you to give me a status report because I haven't done a lot on that since the end over so you are the most up to date person trusted to Azure you should be as well as I am because I did the end over so I told you but I will I will update everyone we are at the point where we got the 3VM we got the networks with the subnet we got the security rules between groups and and the opening for the for the ports so everything should be ready to get the Poupet installation for the software we should be at that point meaning installing with Poupet all the tools and then going through the migration by itself with the data to process Poupet agent run validate and start data migration I'm sorry I gave you the baby because I'm on holidays so on my side I validated almost all the security groups rules the Poupet configuration is ready to roll so adding the free machines inside the Poupet list of machines we should be able later today or tomorrow to start the first initial Poupet run if it work then I'm waiting for finishing the security rules before starting the initial migration of data Jenkins home and the permanent agent the tricky part will be to find a way to test the data center generation I might do it manually on the new machines because I don't want the update center to be uploaded to the machine until the last minute to avoid having both instances but on the other hand that will require interruption service and trusted CI to be sure that we fully generate any elements because we might need to open firewall rules or change outbound IPs on the machines such as update center so for sure there will be some hiccups after the final migration because holidays that one that one is prior like CI thanks Stefan for the work and the updates you're welcome, thank you to take over as you're building excessive consumption on east-us so these are the virtual machine agents we change the kind of instance and the setup thanks folk for helping me and re-using on that area I checked yesterday the billing seems to decrease a lot we are using spot instances of a kind of instances which is 10 times cheaper in spot so clearly it's worth the effort there might be hiccups especially with for the Jenkins score or long running builds if they have too much agent disconnections I saw some but I cannot evaluate if it's a lot if it's blocking and a ying slowing or if it's working as expected we should wait one or two weeks before seeing a real impact on the billing we have solution if the spot instances are breaking either acceptance, test orness or the other kind we could have different sets of virtual machines the cost in any case is cheaper we were able to remove the elements there is an improvement though with Tim Yakom we should be able to quickly use inbound agent mode for these machines instead of SSH that will clearly increase the stability if we have a spot instances because the time for detection by Jenkins clearly is shorter when the inbound agent drop the connection and that's because Jenkins isn't detecting the loss of connection whereas the inbound agent somehow does notify on loss of connection on death of the machine it does in both cases but in the case of SSH the inbound protocol is mixed inside the SSH protocol Jenkins communicate with the agent with a net SSH client which is a Java native SSH client and the setup of both TCP and SSH on that implementation has a keep alive that is way longer it takes way more time to detect that's the classical agent disconnected and did not reconnected properly also I don't want to work too much on the network there now because we need to migrate CI Jenkins IO to the new network and it's agent to the new network and assess the network performances once this will be done and stable because right now we are operating these elements inside a network which has IP overlaps which is a nightmare to maintain that's the reason Stéphane could you give us an update on the Azure ARM64 virtual machines please Yes, for the the infra part it's running quite well I did switch the default VM used by the Packer building process and it's now using that ARM64 machine so we should save some money on that because it's cheaper and I also set up an instance an agent ARM64 Azure VM on CI Jenkins IO with the same image and everything so we are ready to try to use it as much as possible to save some cache I am still working on a little tiny thing with update CLI trying to check that the image is within the gallery of Azure before offering to upgrade the gallery number for windows but I'm working on it Ok, not windows for sorry for this one, ARM64 So that issue is cop to infra CI am I correct Yes Given the title so is it closeable from your point of view? Not yet because of that little box on the top of the update CLI I'm trying to finish that today I'm sorry to take more time than I thought No problem, is there another issue for tracking ARM64 on CI Jenkins IO? Not quite sure I don't think Ok, I will take care because the final step for CI Jenkins IO if we had such an issue will be to definitely remove AWS ARM64 virtual machines we should even be able to clean up any EC2 configuration from CI Jenkins IO Wondering if it's not done I'm sorry, I'm not quite sure Ok, if it's ok for you If either creating or updating existing issue and do the check since you will be in holidays, is that ok for you? It's perfect, thank you so much My memory is not quite sure Ok One last mile, update CLI before closure Ok Update the portal to check issue and refresh status Ok, cool Next, major task is migrating our Kubernetes public workload to the new cluster in the new network public Cates Takeaway 1 we will pay less money with that new cluster Takeaway 2, we will have a non-overlapped network with better performances Takeaway 3 that cluster is IPv6 compliant we should be able to publish our services running on this cluster on IPv6 for our Indian friends Hervé, what's the status and did I miss something? The status it's in progress you didn't miss anything currently having a stateless services it's also a great way to to remember how the services are running on what's the need and the different differences like post-credits, etc Ok So it will also a way to do some cleanup slash import for Jenkins that are your DNS records which are most of them not as good yet Ok let me know when you want to focus on the post-grace QL but I assume you have already the stateless ones about DNS records just a reminder don't forget to set up if not already known the DNS TTL to the minimum possible before the migrations the goal is if we need a rollback we don't want to wait 10 minutes for the rollback to be effective for everyone migration ideally do it 24 hours before a DNS record change is that ok for you Hervé? award on the post-grace QL right now we have the let's say free services that are using post-grace QL database we tried a lot of things but it appeared that we need to create a new instance on the new network and for instance migration for this free we will have to dump stop the service dump the database import on the new service start the service from the new cluster that's what Stefan worked on with report Jenkins IO few months ago when we migrated it from and same for key clock when we migrated them from AWS to Azure so we will have service stopped but that's ok if we just do the announcement properly we can start with key clock because the only person impacted by key clock will be us so Hervé you should be able to do it without further announcements just let us know at least one hour before as internal synchronisation but also you discovered some services are using separated databases that in any case will need to be migrated because they were using let's say the legacy post-grace QL managed service in Azure so we might need to create first the new instance on the new network and migrate as well these instances so we can clean up four more services another questions we might have is a upgrade for Azure bucket from v1 to v2 not sure yet of what's the pro cons and the cost related to that but it was proposed in Azure portal when looking at their settings may I ask you to open an issue describing this we might have a few where the question stands a few buckets for which the question stands it's not mandatory for the migration as far as I can tell but I might be wrong just wanted to be sure because I might have missed an element ok is that ok for you to open an issue describing this so then we can have the discussion for the cost or understanding there we might add a team on the discussion thread you might have insight on that part that might have an impact I don't know if it's positive, negative or non on the way Kubernetes cluster are mounting the PVCs using version 1 and version 2 I'm almost sure that for instance if you want to use NFS instead of the default Samba mounting you need a version 2 version 1 does not support presenting the blob storage to virtual machines as a NFS mount that's one of the impact I'm 100% sure that I saw that on the NFS status that's a requirement you need a version 2 let me write this down so version 2 version 1 for bucket storage to be created Good catch, thanks R.V. Stéphane, you open an issue about leftover disk to clean up on Digital Ocean which name start by pvc-something Yes, I just opened the issue but I don't know what that pvc was used for that could be something back from experiment from Aether, R.V.Ohai for the ACPDOKS public cluster or someone another one I propose R.V.Ohai if you're okay we should check during the upcoming days together we take a coffee we look at the state and if we don't know we can just delete it because there isn't any sensitive data in Digital Ocean yet Good for you Thanks Stéphane Digital Ocean virtual machine agent instead of container agent Stéphane, R.V.Ohai What co-presents what is that task about and what is the status piece I will start because I did the first step I struggle with Pecker to be able to build some image for Digital Ocean to be used as image, template image for VM that we would we would spawn as agent from the controllers I did manage to have a nice image on Intel AMD AMD and I pushed my code online I built one image so now I'm giving the baby back to R.V.Ohai to play around with the plugin Jenkins to spawn the actual VM Anything to add From my side OK, so as explained on the issue the idea is to stop using Kubernetes agent on Digital Ocean because it cannot auto scale to zero and we cannot control at low level the outbound security groups we cannot forbid SSH outbound so both of them are major reasons for us to switch to virtual machines this is what gave in Morgan to do at the beginning of the partnership almost 2 years ago so we are back to that initial assessment that might be better at least for the billing so that's why we started these elements the limitation is that we only have Linux Intel AMD machine we don't have IRM Linux we don't have Windows machines on Digital Ocean but still that could help especially with the spot non spot issue we mentioned earlier about the virtual machine agent on CI Jenkins Any question or anything to add on that part Cool, thanks R.V.Ohai could you describe the next issue about clean up and import and manage data monitoring in Terraform created in 2016 which later has been duplicated as code in the Terraform repository and there was also some leftover like a confluence slowness monitor and two update center jobs on CI Jenkins.io monitors so I've removed the duplicated monitors and for the for the update center jobs we have to come up with an alternative solution as we don't want to install to open trusted CI Jenkins.io where these jobs are running but instead adding a creation of a file somewhere in a public place from these jobs so we can monitor and observe these jobs without opening access interested to be found or monitoring trusted.ci job update center as we do not want to use the debug inside the controller itself cool that was a lot of clean up and work thanks thanks thanks some groups running on the also declared in the pipette repository but the probes we don't need the data docs image is the same probes that were defined in the data docs are also defined in the pipette repository yes for good reasons it's because we need both the probes need to be on all the infrastructure on each machine virtual machines are managed by pipettes so that's why you have to define them on pipette and some machines are Kubernetes nodes on the node pool so for that you need a demand set with the elm chart data doc that's why you have a duplication of the probes so that's a good reason and the question is more do we need custom probes that's more the real question most of these probes could be replaced by a synthetics as for today which mean fully delegating all that things to to data doc but the choice of having them on Kubernetes virtual machine or both is not really an interesting question if we need them then we install them everywhere the goal is to have multiple data points that's statistics thanks artifact caching proxy is unreliable I haven't had time no I walked a bit on the inbound agent part the goal is to move agent on closer network of the ACP on Azure that would solve the issue because DNS is now served and Digital Assign is not used by the BOM anymore so we should be able to leverage the limits of the ACP behavior so the next step is checking again that issue once the agent will be on a closer network I intend to work on that on the upcoming weeks now we have validated that inbound agent are working as expected so all of these issues should be worked on that's a really a lot I just want to quickly cover a few new issues that are marked as triage is it okay for everyone or I propose I keep them as triage we just read the title and see if they are mandatory or if we can postpone triage to next week is that okay for you so ARM64 not pull on public gates to start using ARM64 pods we defer that because Stefan is going to be in holidays and this target the stateless services that Erwe mentioned are going to migrate to the new cluster so that means that Stefan you weren't fast enough so now you have to wait for the migration to be complete I'm sorry I'm slow I know so I will postpone this one I let it as triage the goal is you can in any case you can start creating the not pull but you won't be able to migrate the services until the migration is fully finished for these services just a reminder the idea is that some of our services such as Javadoc are just a web server that survived from file system so there is an opportunity here to run these services on machines that are based on ARM64 instead of Intel because we use NGNX or Apache that exist on both APO architecture that have good performances on both but the cost is clearly cheaper when using ARM64 it consume less energy and cost less so that's a good thing for numerous reasons and the proposal to use them on not pulls and the interest for us is to start managing ARM64 not pull on Kubernetes that could be useful for our builds as well in the future upgrade to Kubernetes 1.25 that one is required to finish Ubuntu 22.04 campaign because that's the only way to migrate to Ubuntu for the Kubernetes node on Azure I would like to drive this upgrade the previous upgrade were driven by either Hervé or Stefan or both but I'm interested into driving this one just because just something I want there is nothing hidden there that will be our gift to you that upgrade campaign is part of getting us eventually off of Ubuntu 18 right because we've got one or two that are still running 18 exactly we don't have a lot of choice on Azure if you have Kubernetes up to 1.24 you will have Ubuntu 18 below it's a custom kernel, custom enforcing so the security issues are back pocked by Azure I would prefer having Ubuntu 22 especially with the control groups behaviors because the new Ubuntu features a new control group major ratio so my proposal is to add this to the upcoming milestone I want to start reading the changelog and sharing on that area and preparing the issue next one is missing adopt this plugin topic on maintainer that one does have any triage or milestone that has been open for discussion by Adrien it's already so it's already put a question from Daniel oh cool one years right there's a blocking issue that has to be resolved before we could proceed with that Daniel's got the approach so any objection on we keep that issue there because it's a convenient way to discuss there is no expectation from the infrastructure though so that's why maybe it should move somewhere else I'm not really sure what are you talking about either let it move it to the repository, motion editor but we can let it there without any milestone good idea this it's related to CRP on the plugin site too yes so maybe having the wrapper issue here is still good because it's about two different repositories we don't have a convenient way to use the equivalent of what is an epic on GERA that issue will be an epic GitHub project does not allow that in a easy way because that add an additional full component on top of that so that's why it's okay it's not on a milestone so we don't have any actionable here and still it's on the discussion area digital left over this to clean up so that one is part of a milestone so that's okay we can remove triage after this meeting remove pull credential for Kubernetes cluster the work from every on Datadog showed that now Datadog was the last component requiring credential for pulling on Kubernetes cluster and it doesn't anymore because they defaulted to gcr.io registry they still provide an image on Docker Hub and you can shift it but that's not the default so that mean we should be able to clean up some code I want and I don't have the willingness to work on that task for the upcoming milestone so I propose to keep it as triage unless someone feel it's important or is really interested in working on it I don't mind either digital OSN virtual machine agent okay pod garbage collector to a Jenkins Kubernetes cluster that one is an answer to the concern from our friends at cloudbiz who are paying for our AWS accounts and they were concerned about if we increase the maximum limits for the bomb builds of available resources we and we don't take care of cleaning up pods that could still be running that's what happened with the virtual machine on March so we cannot say it will never happen for the pods as well so that's why that one will be a safety concern by adding I propose different solution on the issue but the process that will delete at least once a day all the remaining pods because we don't have any kind of usage on AWS that should take a pod up and running more than 6 to 8 hours so yeah, if we remove all the pods that are detected as more than one day hold then it's okay and we should never have any problem then I think we can postpone this one in 2 miles ton, is that okay for all of you because it will be only mandatory once we will start going back to the bomb builds using bigger machines and I think that's all do you have other issues that I could have missed that should go on the upcoming milestone or that you want to raise now none for me Stefan are everything good for you everything is perfect cool so I'm stopping the screen share so I'm gonna stop the recording see you next week for people watching us