 Hello everyone, welcome to the Jenkins Infrastructure Weekly Team Meeting. We are the 9th of January 2024. Today we are four on the virtual table. Kevin Martin, Stephen Merle, Hervé Le Meurre, and myself, Damien Duportal. Marquois, join us later. First of all, a word about the new weekly 2.440. So, build failed. Après 2 heures et 30 minutes, due to GitHub issues. So, that means the release is currently delayed. I've asked someone to trigger the build again, and we will do it at the end of the meeting if it's not done. Build to be retried. Looks like today wasn't a good day for Microsoft. We don't see a lot of things on the status web page until last minute, but yeah, we have a network issue yesterday and particularly this morning. And same on GitHub data center, so... I don't know what is at stake, but something is... Someone cut the fiber. Right now they send a tweet saying that has been resolved. Yep, that's what status GitHub says, but yeah. We had a lot of network hiccups on the same data center. Anyway, so that means release delayed for a few hours. Do we have other announcements? I don't. Do you folks? Yes, survey, it's recording. If you don't have an announcement, a word on the upcoming calendar. So the next weekly for one next week, of course. 16, is that correct? Oh no, I have an announcement. Sorry. I won't be able to run the meeting next week. In two weeks I will be able, and in three weeks I won't be able. So I need two people to cover for me. One next week and one in three weeks. I will say the first one every and the second one Le Meur. Good try. Every way you need to defend yourself. Next week. And in three weeks. Every way are you around? Three weeks will be the 30. And need someone to run the meeting. Don't worry, we will manage. Yeah, okay. So next week is next week. So one 16. Tomorrow is, there is something, I think it's release candidate. Is that correct for the next LTS? Next baseline, yes. Yes, release candidate. Tomorrow. 426.3. 426.3, yes. Yes. LTS. Happen next week. 24th of January. LTS, this one. 2.2. 426.3. Okay. Okay. I think everyone is on, is there that day? It's a Wednesday. Yes. I'll be there as well, just to be sure. 2 weeks. Okay. Is there any security announcement? No, there isn't any. And of course major events. Who will force them? 2.3.4. February. I saw scale X, but I forgot when it is. I know that Mark and Alisa are going, are trying to go. Scale. March 15th. Thanks. March. Yep. Mai. Mais aujourd'hui on met infrastructure project member, most probably will see people if you are on the West Coast. March. Any announcement or calendar question? So let's start with the, this week milestone. So what are the tasks that were able to finish? Nous avons aidé l'utilisateur de problèmes avec leur création d'account parce qu'ils étaient toujours connectés. Donc, la solution, depuis que l'Antispan a été traité, pour eux, c'était de nettoyer le cash ou utiliser un autre browser ou une session privée. Ils confirmaient que c'était ok. Nous avons fixé un problème avec l'artif à cacher Proxy Repo DO Genkin Saio qui a été causé par Mark and Hai, qui n'a pas oublié de nettoyer l'ACP cash deux ou trois semaines auparavant, quand nous avons terminé l'opération sur l'artif à cacher Proxy Repo DO Genkin Saio. Nous avons retiré des repos de l'artif à cacher Proxy Repo DO Genkin Saio pour utiliser Apache Maven pour des dépendances. En faisant ça, nous avons changé l'interne représentation de la maven dépendance C3 qui a résulté sur l'acp cacher quelque chose qui a changé par la check-sum compared à Apache Maven. Donc, ça a causé des problèmes. Le problème a été résolu rapidement, par poursuivre un plein de cleanup de toutes les instances de l'acp. N'importe quelle question ? Nous avons eu un problème avec Genkin Sinfra Contributaire Spotlight Project, qui a été créé par Chris et qui a été créé par des repos différents. Donc, quand nous changeons la coverage, nous avons aussi retiré un plug-in de nos instances de Genkin que nous ne devons pas utiliser. Ce plug-in n'est pas l'autre, ce n'est pas réellement depuis 4 ans. Il s'appelle Github Autostatus. Ce plug-in s'assure que, quand vous avez un pipeline circulant par Github Brunch Source, il envoie une check-up de Github pour chaque stage de l'acp automatiquement. Il n'y a aucun moyen d'enlever. Donc, nous avons eu des checks sur notre pull-request, où chaque stage de l'acp s'est envoyé une nouvelle check-up. Et nous avons créé une protection branchée qui a été focussée sur un peu de ces checks. Donc, par retirer le plug-in, notre travail s'est arrêté sans qu'on ait un pipeline explicite. Notre travail s'est arrêté implicitement et automatiquement par stages de check-up, en faisant des requêtes bloquées. Donc, ça a été fixé parce qu'on a une alternative déjà présente sur nos instances nommée Github Check's Plugin, et celle-ci peut être configurée par notre mChart. Donc, nous l'avons fait, au moins sur les jobs de CLI et des spotlights contributaires, et nous pouvions changer les words de protection branchée. Si vous voyez d'autres de nos jobs en Jean-Kin-Sanfra, qui sont restés avec un state de pannage qui n'a jamais rassemblé, et que tous les cheques sont rassemblés, ce qui veut dire que le pipeline a commencé, a envoyé des cheques, mais pas l'une que l'on avait prévue, ça veut dire que nous devons le faire. Donc, réopenez l'issue et commentez avec les jobs sur le problème. Et nous pouvons l'utiliser pour fixer. Qu'est-ce qu'il y a? Ok. Merci, Servez, pour la séquence d'Occulture de Jean-Kin's Weekly et de Weekly CI. Est-ce que vous pouvez nous rapprocher de l'enjeu ou l'enjeu de ce task, s'il vous plaît ? L'enjeu est aujourd'hui capable de différencier les plug-ins sur chaque instance. Comme en Weekly, la séquence de CI et d'Occulture de Jean-Kin's, ce qui est principalement un démon en instance. Donc, ce n'est pas ma faute. Alex a voulu retirer la séquence d'Occulture de Jean-Kin's et de ses dépendances de la séquence de Weekly, qui nous a bloqués parce qu'on a voulu maintenir la séquence d'Occulture pour maintenant sur la séquence de la séquence de CI et d'Occulture de Jean-Kin's. Donc, j'ai ajouté l'Occulture de Jean-Kin's pour voir la représentation, pour que l'on puisse variablement permettre aux plug-ins de dépendre sur les instances que nous voulons. Donc, maintenant, une séquence de la séquence de Weekly et d'Occulture de Jean-Kin's, cette séquence d'Occulture de Jean-Kin's est disponible par la séquence de la séquence de Weekly comme suffice pour les tags. Qu'est-ce que vous avez questionnés ? C'est assez clair pour moi. Merci. J'ai oublié de nettoyer la séquence de la séquence de l'Occulture de Jean-Kin's quand la séquence de la séquence de la séquence de la séquence de Jean-Kin's a changé. Il y a eu un repository de Github pour être élevé. Donc, il n'y a pas de réaction de la séquence de la séquence de Jean-Kin's. C'était un passé. Une autre séquence de la séquence de la séquence de la séquence de Jean-Kin's est de la séquence de l'Occulture de Jean-Kin's. Tout ça a été déclaré. Et il y a eu un nouveau test de la séquence de la séquence de la séquence de la séquence de Jean-Kin's. On a vu que la séquence de la séquence de la séquence de l'Occulture de Jean-Kin's a changé en effet. Donc, c'est la séquence de la séquence de la séquence de la séquence de Jean-Kin's. Et un Daniella Amory a changé en effet. Il y a eu un œil qui était auprès de plusieurs épées, et notre Vaynerd et la Sinsong. de la forme de G-Center, que l'on voulait retirer, et qu'on a pu filtrer avec GroupID ArtifactID juste un peu d'artifacts. Donc, ça fait que tous les fonds sont déplicés. Donc, ça a été retiré, et le propre filtering a été ajouté à un autre nom de G-Center of Funds, qui a juste un peu de megabytes d'artifacts. Et le but est d'égrer tout le plug-in en temps que nous allons utiliser, donc ce que la G-Center of Funds devrait aussi être élevé dans quelques mois. Donc, merci Basil pour ça. Il y a été un nouveau app Github installé, seulement pour l'adaptation de la réplique de la réparation. Ça permet les éléments spécifiques de la chaîne de la intégration. Nous avons utilisé l'intégration de l'air, en temps. Donc maintenant, nous avons utilisé les éléments propres et modernes de l'API pour manipuler l'RPU. Ça permet aux administrateurs de l'air pour accéder et dégager les changes de l'air et pour les requises. Donc, c'est un peu spécifique pour cette application. Also, Alex a demandé d'être administrateur sur CI Genkin-Sayou. Donc, nous avons ajouté le Genkin-Sayou, qui est le groupe dans lequel nous avons défini l'admission de l'admin juste pour CI Genkin-Sayou. Ce n'est pas un autre groupe. Le but était de lui permettre de pouvoir répliquer le travail pour le score de Genkin et les plugins. Ça m'a aidé beaucoup. CI Genkin-Sayou, seulement, pour permettre de répliquer le travail. Nous avons aidé d'un contributaire pour changer et de retirer le code de réplique de l'API et de répliquer le code de réplique. Donc, c'était facile de le faire sur CI Genkin-Sayou et de l'admission de l'admin et de les conséquences de l'infra-CI. Nous devions retirer le cobertura, qui était installé en soi, d'un stage de github. Je ne sais pas pourquoi la dépendance, mais ce qui est sûr c'est que cette dépendance de la chaine n'a pas été installée en l'infra et de l'CI. Et aussi, CI Genkin-Sayou est maintenant offert de plus de badges. Il a choisi le plugin de l'extension de badges, comme nous l'avons discuté la semaine dernière. Nous ne sommes pas sûrs, mais après la check-in, il n'y a pas de plus de sécurité sur le plugin de badges. Et maintenant, Mark et Deryn Pop sont managés de cette plugin, donc ce n'est pas une raison pour ne pas faire cela spécifiquement. Il a été modernisé, donc c'était ok pour installer l'admission, et nous avons un contributaire à l'appui avec ce changement. A-t-on vu d'autres problèmes, depuis l'éteindre, il y a-t-il n'y a-t-il n'a-t-il n'a-t-il n'a-t-il n'a-t-il dramatique increase en load ou quelque chose comme ça ? A-t-il n'y a-t-il n'a-t-il pas de aucune exception ? Je n'ai pas checké pour être assez honnête, je n'ai seulement checké les métriques et je n'ai pas checké les changements. Ce qui est plus important de checker est le Webdo de CI Genkin-Sayou. Mais je ne pense pas que ce soit beaucoup de volume, parce que c'est juste des images minues. Je pense que je me souviens d'avoir fait une évaluation, et c'était vraiment très bas, quand nous avions discuté de retirer le plug-in, donc ça ressemble bien. Mais oui, nous savons qu'il y a beaucoup de bandwidth sur celui-là, ce qui sera le premier erreur, et nous devons checker les logements éventuellement. Merci. Nous avons fixé le travail de crawler, il y avait beaucoup de problèmes internaux. Tout d'abord, nous avons eu des crédits expiés de la nouvelle centrale de l'appareil de la nouvelle. Nous avons aussi une accounte Azure Storage Credential, mais aussi une API Cloudflare et une R2 Cloudflare. C'était une bonne opportunité pour, d'abord, de renouer le token, de renouer le credential, bien sûr, c'était le corps de l'issue. Et aussi, nous pouvons restricter l'appareil de l'appareil pour accéder à ces buckets sur les deux Cloudflare. Pour R2, c'était facile, presque. Pour Azure, c'était un peu plus compliqué. C'est important de mentionner les deux. D'abord, depuis la nouvelle subscription, nous avons deux networks par agent-controller. Nous ajoutons pour créer un net additionnel pour l'appareil fixé à l'IP. Donc, la raison est que, en utilisant un network peer sur Azure, il n'agit pas d'assurer sur notre set-up d'avoir des demandes sur l'Internet routé d'un network à l'autre. C'est techniquement possible, mais cela requiert beaucoup de configuration pour rien comparé à la coste de créer un net-getway et un public IP sur la nouvelle subscription. Donc, maintenant, pour l'assurance, par exemple, nous avons deux public IPs. Nous avons un public IP qui requiert d'origine des agents permanents et les agents formaires que nous n'avons pas choisi pour maintenant et le public IP, l'IP pour l'agent-controller. Nous avons le même sur le 3rd CI et nous devons faire cela afin d'assurer une restriction par l'IP pour Cloud for Air 2 parce que sinon, l'IP était toujours changée. Dans le cas de Azure, il n'agit pas d'utilisation de l'IP. Donc, c'est le truc. Quand vous arrivez d'un autre net-getway à un account de storage, il n'utilise pas d'utilisation de l'internel routier, mais il n'utilise pas d'un net-getway sans l'IP routier. Il choisit d'utiliser un autre net-getway par Microsoft, pour que le public IP soit déterminé. Mais, la documentation de Microsoft dit que si vous avez besoin d'un net-getway privé pour un autre account de storage par une route privée, vous n'avez qu'à évaluer ce qu'il s'appelle un point de vue. C'est quelque chose de chaque net-getway quand vous avez fait un nom comme whatever.sas.storage.microsoft.net, en ce cas, les demandes sont toujours routées par le local-getway qui le envoie par le systeme de Microsoft. Et ensuite, vous pouvez le restrecher par l'IP par spécifier le net-getway subnet ou le net-getway virtual ou en utilisant l'actif directeur. Dans notre cas, on utilise le net-getway subnet. Donc, nous n'avons qu'à évaluer pour le net-getway subnet pour spécifier la liste de net-getway subnet où les demandes peuvent arriver. Donc, c'est entré pour évaluer le net-getway privé parce qu'on a besoin de l'agent pour manipuler le net-getway via Terraform. Et j'ai réalisé que j'ai oublié les net-getways publics pour que notre container puisse évaluer le net-getway. Je pense que le net-getway subnet est terminé ou je l'ai oublié. Les net-getways subnet et points. Warning. Missing public for updates. Jenkins IO. J'ai oublié ceci, mais je l'ai réalisé. Et enfin, Stéphane a commencé le travail de utiliser Update CLI pour détecter l'expérience de la ferme. Update CLI pour détecter l'expiration et ouvrir un pool request pour éviter l'expérience de la ferme et dire que ça va expirer dans 10 jours. On va ouvrir un pool request proposant de changer l'expérience de la ferme. Ceux-ci devraient donc nécessiter une action de nous, ce qui signifie réutiliser la ferme. Éventuellement, laisser l'autre savoir évaluer le pool request et ensuite manuellement retirer le nouveau créant de la ferme et l'évaluer dans la location où il sera. Donc, à moins qu'on ait une actionnable sur l'expiration date, ce qui est plus que juste un alerte. C'est un pool request préparé pour action et ensuite on peut continuer. Est-ce que c'est un bon summary, Stéphane, ou est-ce que c'est quelque chose sur le travail que vous faites sur cette partie ? Je pense que c'est parfait. Vous disais ça bien. Cool. N'importe quelle question ? Ok, donc le next task est déclaratif pipeline migration assistant plug-in no longer compiled. C'est aussi un autre conséquence de l'opération Gifrog qu'on a faite l'année dernière. Donc, on a fait des dépendances sur les fans de G-Center. Elles étaient utilisées pour fixer les compétences. Et ensuite, la maintenance, including Basile et Marc, a travaillé sur l'augmentation de la dépendance. Donc, on n'a pas besoin de toute la dépendance qui n'existe qu'à G-Center. Et d'ailleurs, la dépendance utilise la version moderne sur le central Apache. Donc, ça nous permet de retirer un peu de ces dépendances. Dépendance Another Gifrog operation Sequence And updating Dependances Help To remove the problem. Any question ? Ok, so moving to the next one. DNS domain, nothing to say. It was renewed as expected and closed. So thanks, Tyler, for sponsoring the Jenkins project. Next one. Get Jenkins.io Is now using the mirror bits parent chart, which is the brand new chart built by Irvi, in the context of the new update center proof of concept. That allows us to control separately the two of three components and to avoid repetition of some critical elements, specifically the persistent volume used and served by both mirror bits and Apache. So side consequences. Now HTTPD run on IRM 64. So less workload for the Intel. Did two downtime while doing that operation. One was anticipated yesterday. No, no, today was anticipated, my bad. Yesterday was, it should not have been the downtime, but I did a dumb mistake. I forgot one of the TLS setup. So get Jenkins.io during two minutes was answering, hey, I don't have a proper certificate for TLS. Ok, but you detected it within minutes. Yes, the total unavailability was two minutes and 15 seconds. Excellent. Already too much for a typo like this. Yes, yes, but it was better, two minutes than two hours. Thank you. And prove that you're not a robot. I like it. I'll be back. Today, there was another consequence due to the way ELM work. Since part of the plan was to rename and rotate. So the idea was to remove the former mirror bits chart, which has a hole in one. And replace it by mirror bits parents done. But one of the subcharts was named mirror bits light. And that one had to be renamed mirror bits to anticipate the future if there is an official mirror bits chart. But when you change the chart, ELM is a piece of crap that say, hey, I will reinstall everything because I don't want to reuse the existing resources. Or you can try to edit a 500 big encrypted file on the secret on Kubernetes that is a big Gzone and YAML, a mix of both, that points to all the history of all the resources and change. And you can ensure that each annotation maps to each element of the file one by one. That would have been... So you mean you're a robot, in fact, okay. No, I told ELM to un install and reinstall the whole thing. And it took five minutes instead of two minutes because Azure was not really in good shape this morning. It's not better this afternoon, but I had to relaunch the job free time. So yeah. Five minutes due to un chart network issue. But now, HTTBD is, thanks to this change, is running in read only. So any corruption of HTTBD cannot allow to write files on that directory. So the next one will be mirror bits. But now any change on the mirror bits parent chart will be usable on both service, which mean now, and that's important for both Stefan, I and Hervé. When we do a change on mirror bits parent starting from now, we cannot merge it as soon as possible like we used to do because we used to have a proof of concept and we don't care if you break it. And by the way, it's broken today. However, now we have to be careful because we can break it. Can't say when we deliver a new version of that chart. All right. Is there any question? No. Okay. Tuning not pool size. So the goal was to start the spending less money. Thanks to the RM64 services on the public cluster, already migrated. In order to do that, we had to pack the remaining Intel services such as mirror bits itself for ACP. Some will be migrated easily such as ACP. Some will be, will require additional effort such as the LDAP. And some have undefined deadlines such as mirror bits because despite the new maintainer we haven't seen a new version yet. So right now we were able to to gain 25% on the theoretical monthly costs that has to be checked this month because we Stefan work on shrinking the size of the Intel nodes. So twice the size, twice the price, half the size, half the price, that better that way. And what we realized is that some of our services were requesting too much memory. There nine first quartile were clearly below what they requested. So we were able to pack the request while keeping the limits as a threshold for killing these services in case something goes wrong. By doing that, we were able to decrease from five to three tiny nodes which effectively is a three quarter of the two big nodes. What we used to have before. So now the next step will be continue and finish what we can on RM 64. ACP LDAP And I think there is a fourth one but I don't remember which one. Any question? And finally, congratulations RV. Digital Ocean gave us 20K for this year as a sponsor. So that means we don't have to move again archive junk in SAIO and that allows us to use archive junk in SAIO as a fallback for get junk in SAIO that will solve a lot of our issues because we can pay for the admin bandwidth ship but still we have to pay for some. So thanks RV. I believe Digital Ocean asked us on the private channel if we can emit a blog post and or present something. RV, it's still okay. We discussed that yesterday. You went high. Is it still okay for you to take care of or at least lead that part with our help? Cool. And we had the request from Digital Ocean to be sure that we put them as an attribution page on a sponsor page. Basel Crow has an action item from our governance board meeting that we're reworking to create a dedicated sponsors page and put Digital Ocean in the correct location there. Right now they're listed but only a single hyperlink. Their sponsorship is probably enough to get them into second from the very top tier. It's that there's their sponsorship top tier. There's likely only one sponsor cloudbies that's that big. But the others are yeah. So they would be on par with GitHub and with JFrog and others. Yeah. Thanks for the help then. Any other question on Digital Ocean and stuff? Okay. Next one then. Okay, no, that was the last one. We had two issues closed as not planned. Don't want to spend time on these two. These were yeah. Unvalid issues. And now work in progress. So for each of these issues we have to evaluate and see if we keep them for the next milestone and what's the status of each. So Alex opened an issue about since the 5 of January some tests on the Jenkins Corporal Request was reaching a time out making all the bills failing. Initial assumption by Alex was because it was the day when we deployed a new Linux agent version 1.45. We tried we tried with a new version 1.46 and steal the time out. Then we tried to roll back yesterday and steal the time out. And today now that everything is back to normal with the latest available version it looks like it's working. I have no idea. It's only on the JDK 17 and 21. Well, but JDK 17 and 21 are the only ones that we test right now. We're not testing Java 11 in general, but we're seeing time outs. We were seeing time outs until the artifact caching proxy reset on bill of materials. So my theory was ACP and that you're flushing the cache and resetting it has done what we needed done. Except for Java 21. This Java 21 thing we have a separate issue. Absolutely. But yeah, JDK 11. Oh JDK 11 is still tested there. Oh, my mistake. Good. Okay. And Windows 17 was working. JDK 11 looks like sometimes it works or it doesn't. And 17 and 21 were always failing. So the ACP could be something since we flushed it. But that was hard to extract because it's hard to deep dive on the test report because Jenkins report there is a failure on the test but which test what does it do? What is the exact output of the test? I mean, that was hard to find these issues. Right. But the good thing is that now I've replayed the pull request and the test is no longer failing on any JDK. So that should be good. I will let Alex close the issue when it's confirmed. Once it's confirmed. Any question on this issue? She'll be closing a bell to be confirmed by Alex. Next one. So I've volunteered to drive this topic. So we had a request for this GSOC project about having a docs Jenkins IO. So that's one of the outcome of this year. And the goal is to have a version documentation of Jenkins. So that requires some infrastructure action from the domain name to preparing the production hosting system defining the builds, deployment, etc. So that's a bit the same idea as the current Jenkins IO and the work that are being done contributor spotlight except that won't be the same architecture. Of course, Jenkins IO will be hosted by us and not by Fastly. We will by Netlify. We will need Fastly in front of the project. So there might be subtle differences. So are these still okay for you to take care of that topic as well? Yes. Cool. So we keep it to the next milestone then. Are they leads? DNS, architecture, etc. Any question on this topic? So yeah question for me. So the intent is to use Fastly for this even in this early stage. It doesn't cost anything. Oh, good. I don't think so. Very good. Okay, I wasn't aware. So that's great. That's wonderful. I was assuming we couldn't do it with Fastly. If you're saying it's doable, that's by all means do it. Great. Thank you. Something that needs to be talked because we forgot to do it for contributor spotlight. We need a runbook that will describe the service and all the architectural elements. That might be even mandatory before even deploying the service itself. I don't mind helping for the contributor spotlight or to this one. But yeah, we need this runbooks in order to document what we do because when the service goes down and we have to analyze it, searching the issue with thousands of commands won't be possible at all. So that's why we need a way to represent this with a runbook. That just give the main pointers, links to code, but that should be some kind of glue between the bricks. We need a runbook. I mean contributor spotlight is easy, it's a tiny website, but Docs, Jane, Kinsayo will start to be a major thing once we will have migrated. So that's why we need this. Is that okay for you Hervé? Yeah, sure. No question? Not for me. Okay. So the next issue is Daniel reported an issue with uplink. It's not top priority, but that issue was a well-download failure when we were using data from 2023, just the Christmas days. What happened is that the data on the table is corrupted. So currently working a few PostgreSQL internal lower indexation, vacuum and searching for the port on the corrupted segments and we'll see if we can fix it. That might be also so it has 300 billions records on the table events. So maybe we will have to start thinking about either moving part of this whole data to a cold storage, change the application, but there might be something because it won't scale indefinitely. And right now, select cons all on the table for instance, even with the index takes roughly 30 minutes and re-indexation of the world table takes six hours. So PostgreSQL works, there is no problem but it starts to be complicated to manage. On many lines? 300 billions. Okay. Yes, that's a lot of evidence. I mean that's huge. Okay. Not from my point of view after 10 years using PostgreSQL, but it's the problem is not do we have a lot of records or not? The problem is that it's a single table with a single model. It's hard to operate. So sometimes we need I already forgot the wording but yeah, separating on different indexes such as having a whole table or something on cold storage. But yeah. Anyway, it's currently a running work. I'm logging by the way all the documentation and command running. It's a screen running on background on the private VPN machine because it has a network interface on the public DB network. And that is yes to run and anyone can operate if I'm gone. So yeah. Trying to write this down. It takes time but it's not a priority as we mentioned last week. So if you research for ideas, yes, working on the uplink modernization or artist stability could be good. Any question? Nope. Okay. Java 21 intermittent out of memory error. I feel like this one will be funny. So basically some builds on Jenkins fails sometime but not always with an out of memory issue. I can't explain why do we see out of memory but the builds continue at least closing the gap. Out of memory, OM kill will have the agent kill not a retry to happen but Maven should not be able to finish the process. So there is something I don't understand here. The OM seems to be reported by the GVM process itself. So maybe it's a child process of the GVM that has been OM kill and reporting OM kill but they are not really sure. We have to check carefully. We have the name of the agent which will check the logs on data.dog and metrics. And also basically gave us some hints. One of the pointers here is that we are only using four or eight gigabytes memory pods. I don't remember trying these builds on pull requests using virtual machines instead of container agents could also help because our virtual machines have a lot way more memory because it might be just a bit more memory is required. I believe we need help here because yet that can be time consuming for non Java easy users. However, since it has infrastructure related, yeah. There's a fine balance to find. I wasn't aware so important to share it. There is a GVM.config on.mvn so may even read these settings and add them to the usual settings XML. So maybe this one can be changed or we could increase the size the memory size of the agent. So back to the logic then was that we're not it's not that the pod is being killed but that the Java process is being killed in that's executing inside the pod. One of the Java processes that's the thing. Right. If it's the Java process then the world pod or is it the agent reporting when I'm killed. That need to be investigated. Okay. But yeah, we have to walk on this one. It's not a blocker as Mark mentioned last week you are retraining but it's it's thought to be quite the annoyance and I mean gdk 21 full support is important. I've read recently I will share the issue the new garbage collector on gdk 21 by the way has some word behavior which is a multi mapping memory. So it's not om killed and it's not directly related but we can see way more memory use than a load without being om killed because the I think that's RSS part of the allocated memory which is reported as use three times so you see the global summed up value while you only use on the physical memory just a third of this. I will share the article with you folks but that can be interesting when you look at gdk 21 memory dumps or memory matrix on data. Which might be the case here. Next issue still the gigit cloning not converting linens on windows so last week another contributor confirm the same thing as what James saw that yeah when using container agent windows for gdk it work but not when using our windows virtual machine agents. So is it related to the way we configure agent I mean it shows two different plugins that spawned the agent process on two different ways so that could be the agent process starting with different setups having a direct impact on gigit or is it something on the environment of this container virtual machine that differs and that gigit reads from the git installation or did something else I still don't know Mark you said you wanted to take that issue I don't know if you have clues or pointers here on your area and I I think we've got enough here let's keep it with me there's some difference between virtual machine and non virtual machine that needs to be explored further I think I think it's very fair for me to be the one who investigates it I've I'm certainly well connected to the git plug-in and I have plenty of shame about some of the code in it okay thanks so let's keep it open and on the next milestone it's not a blocker if you see a contributor with that program just ask them to update once and migrate to Unix online all the file on their repository and then their problem will be fixed yeah is and I have no objections if we drop it from the next milestone and say Mark we'll get to it when he gets to it I'm okay with that as well it's either is fine okay because it's in terms of I'm much more concerned about the intermittent memory memory failure on Java 21 builds because those are affecting Jenkins core right this one James James Nord's workaround is working just fine so so while it's a real it's a real problem it's not nearly as interesting to the community I think as Java 21 is or as the Java 21 builds are okay in this table okay I think I moved to backlog that issue symbolique link for latest from windows table I will go back to backlog because we need which one it's blubbix fair let me need az copy has requirement below so nothing to say ervish blubbix fair versus az copy I remember that you did stuff before going on holidays can you report the status here just for the sake of sharing I'm trying to use a service service account to generate the ss token for the file share instead of using always using the same file ss token as team jacquem mentioned that it was a bit critical to use ss token my last test from end of december I I need to check if I can use this generated ss token with file share which is not sure at the moment I don't know how we will proceed if we can't use them so based on the work stephen and I did last week on the ss token that was expired it looks like we should be able to revoke the ss token if we change the expiry date if it's the case we just have to write a runbook and we can revoke if the token is is shown accidentally in clear if not working then we have to confirm that ss token can be evoked through expiration date since we have rotation by changing expiration date it's a free month windows in our case that we decided it that could be a solution just revoke by moving it in the past is that okay for you Hervé yes is there so is there something else on that area no I have to start working on it again migration leftover from publicates to IRM64 so that's these are the last survivors on intel okay artifact caching proxy yeah data dog and falco we don't care because these are demon sets so we remove the machine and the demon set remove however mirror bit so it's still problematic we still have key cloak and held up and for update Jenkins I use same idea as mirror bits we have I should be the underthink already in IRM64 but still mirror bits running on intel so I don't know Stefan and I see Stefan dropped my network issue Stefan and I are sharing the work here artifact caching proxy should be easy to do expect something later today or tomorrow held up will be a piece of work because it looks like from what I saw this morning we haven't updated the held up image seems quite sometimes now so we don't have an automated update of the held up system so it's not public facing because it's restricted by IP for access however we should start by updating to a recent version see if we don't block anything and then update to IRM64 so that will be the next step for next milestone any question on that topic next ACP easy then held up need to update it first the reason why I'm worried about held up is because when we did the IRM64 for VPN the CPU changes when changing the cryptographic libraries was having sometimes different behavior that were that were blocking and required the reconfiguration with a way more modern setup it's not a problem that means we will enforce cryptographic systems but that takes its toll on the configuration port initially okay I see Stefan so Stefan not there export download mirrorless to a textual representation so Irvay will you be okay to pair with Stefan on that area because Stefan build part of the reports and we discussed you and I about the reporting of the build failure entrusted such as crawler so since Stefan already worked on the part of how to publish files on report JNK then the two of you should be able to pair on working on first providing a json a real life API but that will be a json file but well shaped and well named to publish all of our outbound in bond IP not only the one from mirror orbit so we could provide a proper actionable for our users and then you could switch to the experience learn from that task to start working on the using report JNK in Sayo to report and create data dog alerts about hey we have crawler failing or we have air player failing entrusted is that okay for you yeah on reports so I let you coordinate you need to absolutely pair on that topic next step she's an API to add outbound IP for instance is there any question on that topic okay infrascii jenkin sayo on RM 64 so for this one okay it was Stefan I hope nothing wrong with our friend so what was done now Kubernetes management is using an RM 64 agent with the only one image so the last mile will be to drop any remnant of Docker and file image so one less image to maintain next steps Stefan started to work earlier today on the terraform jobs which are also using their own image that we maintain Docker HACICOP tools so here we also have the same idea Stefan is working on the usage it's on the pipeline library he will have now to try one of the jobs on a draft pull request and if it choose properly with all the tools then he can merge it and use it everywhere and then clean up the image the goal as a reminder is for us to only have all in one image defined as an administrator put template and infrascii like we do for CI Jenkins IO so when there is for instance an AWS command line updates like every week multiple times we only have one image to update the packer image all in one and that the benefit is that we can use a RM 64 for the agent in that case side note that was also the case for Stefan when building that generic template we detected that some of our jobs are running agent on the same virtual machine than the controllers for infrascii and release ci that should be removed as soon as possible so moving to RM 64 is a way for us to do it efficiently because when it run RM 64 it's a different network and a different not pull so different virtual machines in the short future Stefan propose to also create a new cluster on the subscription so we should be able to run agent for infrascii at least on the new subscription and not having to pay on our system but more on this later this month so Stefan should keep working on this one for the cost reduction project is there any question okay so docker n file done next step terraform goss no action done since last week Stefan was focused on RM 64 I think almost there no action done almost there though so I propose not to put it back to backlog and since we did all of the EV lifting we'll try to help Stefan on that part if he's not able we prefer him working with RV on the reports and other tasks that one is just a few goss I think the learning curve is good enough for anyone on team being able to change here is that okay for your RV the hidden question is unless you want to help Stefan finish this one yourself to learn more about the goss thing I don't mind either I'll see you this time when I get this week okay we have Chinese websites I believe that was work in progress but with the early days it would no work was achieved on this one that realistically do you think you will have time to spend on this week or do you want us to move this to backlog this week no this week is just too busy and I'm right now the bottleneck on that one as well okay no work to be done this week good and finally updates Genkin Sayo so I'm gonna report this time since RV was off so before Christmas we were waiting for Daniel to review waiting for review by Gensek but they were busy on holidays and stuff so now as we mentioned with the crawler improved token management exploration and IP restriction which is a good thing the service is failing service is not working hercing the pod failing to start due to mount permission error so we need to fix this as soon as possible detected this while walking on the gauge and Genkin Sayo mirror bits parent charts it started a few days ago so it's a recent change looks like related to hercing the updates I believe it might be related to the non route permission of the RC process while trying to mount and access the Azure file system so go to check in detail because yeah that was inside deep inside Kubernetes that could also be a problem an issue with the updates Genkin Sayo IP restrictions so still not sure so need to work on this one most probably we'll have to retrieve log if I'm not the one to do it the goal is to check the logs of the CSI demon set agent running on a same virtual machine where the pod is scheduled the failing pod because the CSI will give logs about why it was forbidden to mount inside the process and then we could have it so that the next next step which by the way delayed performance test is there any question on that topic I believe that to next week will be full for Daniel or the Genseng tip is that correct Mark sorry ask that question again I believe that's Genseng team will be full for the two upcoming next week is that correct yes right they're busy so most probably we only do tiny things on the new update center but that globally delay Daniel is aware about the cost reduction here we have discussed this last week so it will do its best but yeah right now security is way more important than the cost reduction here exactly there is pressuring this thing to arrive prematurely would be bad for the Jenkins project let's not do that thanks so that's all for the the task let's see if we have issue to try age we don't have new issues to try age oh eventually gerologin page so the problem is there another topic you want who had the backlog on the milestone or you want to discuss as part of this meeting no no no sorry cool then thanks for the work folks i'm gonna stop screen share i'm gonna stop recording Donc, à la prochaine semaine pour les gens qui nous suivent sur le record. Bye bye !