 Hello everyone, welcome to the GeneKins Infrastructure Weekly Team Meeting. We are the five of July 2022. So today we have your servitor at the M&U portal. So Herve is not there. Mark might or might not be. Stéphane, no Basil, no team. Mark just arrived and we have, we know. No mark. OK, so for attendees today, let's get started. First of all, weekly release today. The release is currently on hold, as far as I know. The reason is that regression has been detected. Right now, there is a tentative to fix that regression. So the regression is mainly. Is many a related to WebSockets that seems to be not working slash remove slash. Not be even as expected with the latest JT upgrade. I just realized I forgot to share the link to the notes. So let me do that right now. So it's on hold team on team here. Combin, Daniel Baker, aware of that, they have blocked the pipeline, the release pipeline, to be sure that even if it's triggered automatically in a risk. So great exercise for us to see how we could handle better in the future, the security release when we want to disable the weekly release. That could be a good, let's say manual test. Nothing expected from the infra team. So it was just for your why. The team will should handle that if they ask and if you're on there, you don't state to help them, but that should be OK. We didn't change anything on the infra as far as I remember. No other announcement on my side. I don't know if any of you has any. OK, so let's get started with the. Task that we were able to close and finish last week. We had free Gitter permission. I handled one specific for your mark. But the other were handled either by Mark or Tim. One password reset. Thanks, Mark, for handling that. So user had the need to reset the password on accounts. So we tried to to gain enough training. So we tried to to gain enough trust on the user with different proof before being able to resets to be sure that there isn't someone taking it over. So on that, actually, I've got a technique I've been using there that I'm not sure if others want to use it or not. Or if I should document it, I've typically found what their GitHub account was by by guessing reverse engineering, etc. And then ask them to clon fork a copy of a Jenkins repo into that GitHub account so that I can see the fork. It's their way of proving they have control of it. And and then they just delete the fork. The idea was, OK, I now know you can. This person sent the email, received my message that said fork a copy of the client plugin. They forked it. I can see it publicly that they did the fork. And now I know they have control of that account of that GitHub.com account without having to ask them further questions. Nice. No, it doesn't work if we can't associate their Jenkins account to a GitHub account. So it's it's imperfect. But the the fork, this repository was something that that let them see, oh, OK, I can prove to you. I have control of this account and I don't have to do anything damaging. Justem to fork a Jenkins repo to prove they have access to this shesh account. Nice one. It's still useful. Can I ask you to to see if you can write this down on the runbook that will help even a quickly draft content? But it's written on the notes there. And yes, so that will be useful if any of us has any question. Excellents. Any question on the free GitHub permission or the password reset, which are usual maintenance? OK. Then we had the May JVM statistic not available, open by Basil. Sounds like it was an issue on the statistic calculation. So Andrew gave the issue. They seems to have fixed and Basil updated the documentation of the Jenkins shesh status repository. So no action expected from us. So thanks, Hervé. And thanks, Andrew, for that help. We helped Adrien, Le Charpentier, to set up a pipeline for the repository plugin Hillscoring, which is Google Summer of Code project that Adrien is monitoring. So sounds like he's happy. His need was quite classical, generating a jar file and run docker container with test container modules from Maven directly. So we gave him documentation. We gave him technical link and we pointed some technical element and documentation. And Hervé configured the repository with success on CI Jenkins IO. So yes, we have all the details. Sounds like he is happy, closed the issue. So nothing to do here. Thanks for the work folks on that one. Finally, the first big one we had that we completed. Stefan and I worked on redirecting the domains, the legacy domains, PKJ, Jenkins CI.org, to PKJ Jenkins.io. We had a lot and a lot of issues last Friday. We had some outages for a few minutes. We were able to fix it. So the summary is that when we have a website beyond Fastly, we have to ensure that the host header of the request that Fastly sends to the backend is the same as the domain that Fastly use to connect in HTTPS for the same backend. For instance, in that case, end user were reaching Fastly with PKJ Jenkins IO. While Fastly configure which host that service is caching PKJ origin, the Jenkins.io. So by keeping the same header from the original user, Apache was cutting the connection and ding on a terrible HTTP 421 error. That's the root cause. There has been a chain of events and changes that we did. The good thing is that now Stefan and I understand way better the PKJ Jenkins IO file system, Apache configuration, and that was a great opportunity to clean up and continue working on the Poupet stuff. We ended up, it's not written there, but one of the consequences there is that we successfully enabled the Vagrant manual acceptance testing for the Poupet role. It was using VirtualBox, which made it impossible for Tim or Stefan to use them on their IRM Mac. And it was also impossible to run VirtualBox machines on the CI. Now it use Vagrant with Docker, which mean we have a lot of new use case. It's clearly powerful and it works with Intel and IRM flawlessly. So we can start adding a bunch of infrastructure automation tests. So thanks folks. That's a nice consequence of this one. Congratulation, you did all the work. So thanks for the help because I wouldn't have spent only one day, I think that would, without your help, Stefan, I would have spent the whole weekend on that. Three key one, so thanks a lot. First header and SNI must be the same. Allow this to improve Poupet Vagrant with Docker now. So that's the road for cleaning up Poupet, especially with the work that Stefan is doing on updates. That's all on the fully closed tasks. Any question? So walk in progress. First one, Kubernetes upgrade. It has been done operationally speaking. Let me move that one on top loop. So great work, Stefan and Hervé. They handled these upgrades on the forecluster during the week. Cube, I'm trying to take notes. So what is left to do before closing the issue, Stefan? The post mortem kind of. I don't know how you say that when there is no, the issue, but I wrote a draft and we have to check together how we can improve that draft because my English is not good enough. Okay, are done. Post mortem, so can you summarize the issue we had? The main issue? Yes, we dump into something. We had to back everything in case we had a problem that took us some time. But the main problem was on AKS and AKS only. On Azure Kubernetes, yep. For God's sake. For God's sake. Persistent volume based on Azure File Bucket. So Azure File is like Amazon S3. It's not a block device. It's a file bucket. Azure change the implementation to CSI. And the thing is that despite the change log telling us that it should be transparent, each time Kubernetes mounts, this bucket Kubernetes generates a temporary token. It's named SAS token, which is encrypted with a private SSH private key. That token change on each mount remount. That's a transparent token that no user should have, it's a technical item. And that token was by whatever bug, a combination of bug in Kubernetes itself and AKS. That secret was created always on the default namespace if you did not upgrade your persistent volume to the new CSI driver. And there was a directive that you could change to say created on that namespace where it should have worked. Except that's the persistent volume, their definition in YAML, once created on AKS, cannot be changed ever. So we had to delete the persistent volume but keeping the buckets and create a new persistent volume with the new hotfix pointing to the old data. Yeah, the main problem was the fear to lose everything. So we took the time to back up everything to make sure that if something was wrong, we still had a backup. So that created a one-hour outage on the LDAP and on the mirrored download system get Jenkins inside. So that was quite the impact. And so needs to migrate. So the post mortem takes some improvement on the procedure next time. So we can close that issue once the upgrade to Kubernetes 1.23 issues will be written by Stefan and Hervé based on the previous one, the improvement, the changes for the procedure because we need some enhancements somewhere and some things are outdated. So we also need to write the next upgrade issue based on what we learned. And finally, with the post mortem that should be a consequence of the post mortem but I can already say that. We need to migrate the AKS persistent volumes to the new CSI driver. So Hervé, that is with that. I think Stefan have an idea, a global idea. So that was the last big task that we need. So great job folks, almost there. We need to close that issue once this is finished. Let's wait for Hervé to be back from early days. So I propose that we keep that issue for the next milestone. Is that okay for you? Yes. Next one. Consider removing embeddable status plugin. So that's a plugin that has been required to be removed. The thing is that it might broke the readme or documentation of some users, of some plugin users. I don't remember. Someone checked, I think, Mark. We have 169 plugins. So by courtesy to these users, we will remove that one from CI Jenkins.io, the public facing website. Once we will have opened a bunch of pull requests on these documents, on these elements. Hervé was working on that part before the long weekend. So he found a tool that allow us to batch pull request opening to a given repo and he has a tool that does some kind of regex but simpler search. So it's a kind of super said but easier to use. So now he has to assemble of these tools. Sounds like that we have, we had a long discussion, we have a lot of tools that could help on that area. So right now that one should be moved to the next iteration. The goal is to remove that from all of our Jenkins instances. Almost there on CI Jenkins.io. So I move that and Hervé will take care of that when he's back. Any question? No. Folks, I need help to take some notes. Please. I try. At least preparing the next issues will help me a lot. So I can only comment them. I tried to use Hervé's script but it didn't work on my machine. So I need to ask him to send me his machine. Still CI.genkins.org Only CI.genkins.io. Left. Can you ask you to add the link of the next issues, please? While I'm trying to take notes. Hervé is working on the batch. Pull request has a core TZ2 contributors. But you guys already sent that to the next milestone. Right now we are covering the issues that are part of the milestone. July, the 5th. I'm taking one after the other. So I have put two bullets on work in progress. And we have one, two, three, four, five, six, seven, eight, nine in the... Oh, oh, oh, okay. Sorry, sorry. I understood. No problem. So I'm taking the third. Can I let you start on the fourth, please? So. The Enable development integration in JIRA. Nothing was done. So we, we need to work on that. Alas, the infra team has a dependency on administrator on the Genkinsia organization. So I think we will ask for help to team and Daniel to see if they can start this with the help of James. Because I understand the requirements, but yeah, we need the JIRA admin and the Genkinsia admin, which we are not for the second point. Any questions on this one? I'm moving it to the next milestone. Evaluate retrait condition to approve the stability of builds. So that means installing experimental plugins for pipeline on CI Genkinsia and see if it improved the current flakiness. See the issue later. For that one, we delayed since two weeks. I'm not sure. I'm not sure. I'm not sure. But it's okay to install the plugins. We need to ask Jesse, if it's okay. So we can install the plugin. We're waiting for after the weekly this week so we can start proceeding either later today or tomorrow. Or Thursday, if Jesse is out. So we can do it today or tomorrow right after le but, c'est d'automatically retraiter les biais quand l'agent s'est montré. Est-ce que c'est relative au... pas au dernier, mais au dernier, mais à la première, les agents de l'Agen, les agents de l'Agen sont très fraquistes? Oui, il y a une relation. Provise les postes de l'Agen et de l'Agen, il est en production pour les machines virtuelles et l'HRV a fait ça sur les images docker. C'est tout, VMs, ok, mais encore windows, container, images, to do, reminder, that task was to be sure that we don't end user don't have to choose between both keywords on the pipeline on windows environment, because one is the old PowerShell, the old, the second is the recent PowerShell, some machinous box, some machine has both, some machine has only one of two, so the goal is to provide an alias or solution on each, so our base working on that one. Require Java 11 infrastructure thread, for this one, nothing to say, I will keep moving it to the next one because we still have some work, that one should exist until we have the Jenkins LTS that drops the support of TDK8. So far no one complained as far as I know from the last week release that is the first one that drops TDK8 support. Or maybe people who didn't read the changelog, but that's not, there isn't anything we can do about that, no expectation from the team for now, let's keep an eye on this one. Sorry, which one? Require Java 11 infrastructure. Ok, so I don't put that to the next milestone, I put that on an infrastructure next. We can remove the milestone. Clear from this milestone. Yeah, I will, if there is anything that we have to do related to that topic, we will move it on the current milestone at that time. Otherwise we don't need a milestone because no action required from us. Good catch. Migrate updates, Jen Kinsayo to another cloud. Stephen, your turn, I take note. Oh, that's a very nice one. I'm working on the terraform part and yesterday and today I was bumping my head on the access on SSH. I thought that was a firewall problem definition. In fact, no, that was a getaway and rule for the getaway problem. It's working since like an hour launching, spawning the VM and I can access it by SSH. We still need to clean it up a little and to make sure we got all the good users to use within the VM, but it's going forward. Not easy, but. Scratch and jump to the next step. So congrats to Stephen, because yeah, as I told Mark, we know now that what you don't pay in bandwidth with Oracle infrastructure, you pay on trying to understand the documentation of the APA. Exactly. Just a minute, I have an invited guest. Just a minute, sorry. I don't feel confident to enter the meeting alone. There. Sorry. Okay. So great job, Stephen. It sounds really good, so we should be able to start the Poupette production. Poupette work, so more Poupette work for you. Yes. So continue next one, still the priority. We might have to play ping pong on this one. So sometimes if you are blocked, I will take some part, then block you will try as much as possible to share knowledge and have you working on that part. The goal was a full autonomy on the terraform infra part, but for the Poupette, we might need to do it to the two of us because of the priority, because we really need to move this to move forward on other topics. No problem. And I can't expect you to be a terraform expert and the Poupette expert and the local expert and all of these at the same time. So no problem. It's only that that one is required because of the cost. No problem. Any question? Okay. Next topic, docker rate limiting. I didn't have time to do anything. So. Next milestone. Yes. Next milestone. It's on the portal. Nothing done yet. Replacing. Oh, I forgot to ask Marcus statues. So we move it automatically. I ping the mark. Since after last week meeting. The CPUZ agent is still not available. It's not really a problem, but yeah, we need a mark to either share the SSH key with us so we can take it. We need a solution. Maybe Poupette ties automates, but yeah. Nothing block. Nothing. So I will copy and paste and mention mark instead of me. Okay. Mark. Wait. Wait. CIG and Kinsai test are very flakie. So we need to start working on this one. Along with 502 proxy or one accessing pull request on Jenkin CIS. So both issues are on CIG and Kinsai. Oh, the public instance, the big one. The flakiness of agents. First, we have a set of different little items. Sounds like that most of this flakiness is as simple as that. So the thing is that these problems happened either during the Kubernetes upgrade. We have a new Kubernetes version. So that might change things at the networks part. And also, we have a new version. So we have a new version. So we have a new version. So we have a new version. So we have a new version. So we have a new version. So we have a separate version. On the smart. And then, And so we had the updates and big issues. So that might, that could have been a cost. However, Joseph Peterson. Just a habitants like yesterday, some issue opened. So it looks like we have It looks like we have Agent that cut the Connexion. All these agents. All cubanites. So we might need some help on that area. on peut avoir besoin de l'aide dans l'arrière, mais beaucoup d'entre eux sont liés aux bombes, les billets inféminaires, qui ne spawnent que 180 exécuteurs et nous avons seulement 150 éléments. Donc je vais essayer de écrire différents éléments sur cette chose, mais maintenant, oui, c'est assez anoyant. Donc, quelle solution do we have right now? First of all, the Jenkins is waiting a lot trying to create virtual machines that has a public IP, but we reached the limits due to docker rate limiting. So we have to update the configuration to not have the public IP for the IME machines that could be used for some of the bomb or acceptance test builds. That should be an improvement, but less time with the orchestrator trying to block some threads. Secondly, we need to start again the metrics and traces subject on CIG and Kinsayo. So we need someone on the upcoming weeks to take back the try to restart the discussion with Elastic. So we can create an instance of their open telemetry SAS system. So we can connect CIG and Kinsayo and we will have traces of the builds. So we can see what could happen in terms of timing. Sorry, the Datadog plugin cannot tell us that. That's different. That's a different topic. We need open telemetry. Datadog could do open telemetry, but the thing is that we have Elastic proposing to sponsor us. Datadog already sponsor us. So we would want to use Datadog for the private instances, such as in FrostCI or ReleaseCI. So we could get traces from our work because Datadog is by default private. That's only our access. And for the public port, Elastic provide a public dashboard. So the idea is to split. And so with split roles, we have a comparison and we use both. Perfect. But yes, Datadog also has a service. And the question is which one to choose. Let's say both. But we need one. Then we have what we said earlier, the plugin from GSC that will allow builds to restart. Because the reason that Noida Joseph initially is that because it's building us to trigger a new build again. There are some discussion also about the BOM plugin. So we need that much. We have the subject of RV growing the partnership with Digital Ocean that will allow us to have clearly way more way more container and capacity in Digital Ocean. So with all these tiny elements that should keep us for the next week. Then the thing is that if we aren't able to solve this flakiness because it's hard to understand what's going on. Then in that case, we will have to ask some for help from Daniel or Basile or any experts. Give them access and someone that developer of the core that help us going on that direction. So let's continue for the upcoming two weeks because there are two elements on CIG in SAIO. It's really hard to understand from an infrastructure perspective. We really don't have enough observability items there. So we'll let me have an answer for the next two weeks set of minor changes to understand what is going on. Because I don't know that request about it exception. It's like I don't even know. We need better observability. We have improved observability traces. Aggregating pod logs. The thing is that collecting all these logs could be really huge. Datadog might be useful there or elastic again. But we need a collection of these logs to see what happened on the agent site. I have no idea how to deal with that thing. I mean that instance is almost unmanageable. I will strongly suggest creating a CIG in CIO from scratch on Kubernetes because we will benefit from full configs as code, improve manageability without choosing Poupet, not having to split our configuration systems. And then we could benefit from all the Kubernetes log collection and all these opportunities, which we don't have today. So let's see. Maybe we can have a boom specific controller. Which could be a Kubernetes. Yes, also. Since we have weekly, that could be a way to move around that. That's a good point. That's a good idea. That's a really good idea. So let's see. We'll try to discuss that on the coming weeks. So let's keep all an eye on this. But yeah, CIO Jenkins might be... There is that issue. We haven't been able to work on it yet. When you go and CIO Jenkins to the Jenkins score, if you try to go to the pull request tab, you end on a 502. Because it takes so much time to install the pull request. I don't know why. We have to check what Daniel checked, but we reached the time out between the frontal Apache and the back-end Jenkins. If you need to see that as an administrator, you can open a SSH tunnel to the CIVM, and then you can see it. You just have to wait longer. So that's one to add to the next upcoming milestone. And we will ask the answer. Many thanks. Any question on these topics? No. So we'll try to start working on that. I will prefer you folks finishing the Kubernetes documentation and working on the dates CIO Jenkins. And then we'll see if we can take the CIO Jenkins. We still have the infra team sync next. Just one check to see if we did not missed anything. No. Nothing I did. So still the same important, but not prior tasks. Any question? We got a lot of work. That's not a question. Yep, we always have. Okay, so in term of milestones, I can close this week. So many thanks for the help on managing the issues. Oh yeah, sorry. Yep, that's that one. And so I'm updating the notes. And let's go, go, go, keep forward. I'm stopping sharing my screen. Do you have any last question for the recording or the posterity? I don't know. So see you next week. I'm stopping the recording.