 And everyone, welcome to the Jenkins Weekly Infrastructure Meeting. Today, we are the 15th of November, 2022. Roundtable, we have myself, Damien Duportal, Hervé Lemur, Marc White, Stefan Merle. We have Kevin Martins. Kevin, is there one or two E on your name? I'm not sure. Just the one like that. Yep. Cool. And we have Bruno Herarten. OK, six bullets. Let's get started with announcements. So today's weekly is currently being finished. I started to build the container build, but I saw Jenkins IO, the packages, and the while were available already. So I assume we just have a few release items to be checked. I didn't see any issue on that weekly, as usual. But yeah, almost there. Package, while on Docker image. OK. Last change log item, release item to be checked later. Reminder that CI Jenkins IO is still offline. There is a security adversary currently being processed, begin only. Security adversary still being processed. I don't know. I haven't seen any issue on the infra as far as I can tell. But we might have some. It's just they didn't have time to share with us. So I don't know. Did you heard from the security team? So I assume everything is going so far so good. So next weekly, next week, right? So that should be the 22 of November. Next till TS is 13 of November, if I'm correct. You are correct. Next security release. Now the next major event, I assume it's the first them. Just a reminder that there is a CI CD room. And if you want to speak about Jenkins or CI CD or whatever you want, please submit your subjects. I assume that some of us will be able to meet physically there. So do not hesitate if you have questions or if you want to meet the house of infrastructure team and related. Please come in Brussels first week of February. I do plan to come. Do you have other announcement notes before we get started? OK, so let's go. So thanks, Erveille, for review Azure resource periodically. So can I let you explain why I ask you to mention that on today's meeting, Erveille? So I was looking at the different process group in the Azure portal. And I've noticed, for example, the support confluence one with two database we are not using anymore as we don't post confluence. I'd like to do a session periodically where we collectively review the resource group on what's in them. I don't know which period we should be. That's a good point. So the goal is to be sure that if we have items that are not used, that we don't pay cloud resource for that. And if these elements need to be archived, that they are archived, such as the confluence database, for instance, that for creating a database dump that we encrypt with a password stored in one password and on the procedure that supports absolutely unknown for me, do we have a secure location where we can put backup and archives? I don't think we have such item. Because I mean, if we want to secure things, we need a location that is not easy to reach. And even if it's reached, then just a few persons. Sorry? Yes, but that's the easiest part. Because I mean, FTP, object storage, whatever. But you need to put their strict access control to ensure that no one is able to access here. And that's just a few people. And how do you manage, once someone has access, how do you manage that these people won't have access if the project say, OK, they don't contribute anymore? Or if the people are leaving projects? That's the hardest part. And that will be the same for the backups. Because if the backups is easy to reach, then someone won't have to search or to attack the platform. They just have to search for the backup and get the data from there. I mean, that will be clearly easier to break one location than trying to break all the locations. So I assume that's why we don't have a centralized process. I think it's worth asking the board if we don't have an archived thing, a secured one, and a public one, I'm not sure. So the board did just have a discussion in yesterday's board meeting about archiving. And the discussion there was, hey, it was around this meeting notes archive. The question, the point was, hey, the notes for the governance meeting are too long right now. It's slow to load. We need a more authoritative place and preferably a place that's stored for those notes because they're public in GitHub. So the topic is there. It wasn't a discussion of anything private, of anything sensitive of a sensitive nature. So that would need further discussion, but the topic has at least been already brought to the board for brief discussion yesterday. Archiving sensitive data. Because I would, I would quote some security people that say better not having a backup than having a backup with sensitive data open to everyone. Right, right, absolutely. So about the reviewing, there are still things that can be deleted or archived. That's a good idea of it. I propose that we start monthly. Could I let you start and add two meetings, two team meetings, one for November and one for December? So we can spend one hour, at least a Stefan Harvey and I, checking what you discovered that could be deleted. Looks good for, is it okay for you, Harvey? That's the months is absolutely arbitrary. If we see that we need to do it more frequently, we will, otherwise we can keep it one moment. Any question? Let's start monthly, propose our list of monthly at the survey to create the first meetings in the calendar. Okay, so unless someone has a question, I'm moving to the closed items. So what task did we finished completely since last meeting? We have one more survey. I closed an issue right after you export it. So I will take care of updating the notes. So that was an issue. I'm taking in them on the order on GitHub on the right. So where are the old project meeting logs and minutes? That one was opened by Daniel. It seems that since some, quite sometimes now, the web service meetings.genkin-ci.org was unavailable. It was a web server that used to be hosted on a virtual machine name, Edamame. A simple Apache web server where a set of notes from governance meeting in the past, these notes were extracted from IRC and from whatever system by a tool named Robo Butler. And that tool was a kind of bots that was able to publish these notes. This was the case until 2019 when the governance meeting notes shifted to Google Docs. So reference to what Mark said earlier. But these meeting notes were published publicly and we had links referencing these notes from genkins.io. So we diagnosed that the service was down since month. I honestly don't know if I disabled it, if it was Olivier, if it predates Olivier, I'm not sure. We had some leftovers of the service configuration on Poupettes that need to be cleaned up even today, but it was absolutely empty on the machine. So we had to retrieve using Wayback Archive Machines and Daniel's archives. We had to retrieve all the notes. Sounds like it's okay. So following Daniel's proposal, we created a repository since it's public knowledge and that repository is named Governance Meeting under Genkins Infra. It has a docs directory. Everything inside that docs repository are served by your GitHub pages web server. And we have added a CNAME from meetings Genkins.io that used to point to the virtual machine. Now it points to the GitHub page instance. Results is that with the help of RV, we were able to prove that we now have all the links. Okay, there were some missing that have been fixed in the past hours. So all the link to meeting Genkins.io should be now okay. And even the one pointing to the route we have regenerated Apache listing pages on GitHub page with a tool name AP index. So we should be okay. If there is any missing page from a link or something you know, please open an issue. And of course, if you see word pages in the contents that we imported, we might have forgotten some elements added by Wayback Machine for sure. So if you see anything there, please open an issue or propose a pull request fixing the issue. The content of that repository should only move to be fixed. There should be no new contents there unless marked the governance meeting, the governance board decided to also archive some documents. They can put the documents directly there. The goal is to have this stored in the future. Yeah, I think that at least Gavin's recommendation for governance was consider following the same pattern that the UX SIG uses of creating a repository for the governance board. And it would include notes for governance and other things related to governance. So single repository each. Any objections from the infra team for that technique if we were to add a new repository to Jenkins CI for governance? No, that's okay. Okay, great. Makes sense. Any question? So thanks for the help then. Node.js missing on agents. So by moving from the front like projects, everything building front web, need Node.js and eventually Ruby for building websites on Jenkins infrastructure. Plugins, websites, stories, the Jenkins CI website and many more. And after a change on the agent template, trying to switch to our all in one image like Stefan did with for the Java builds a few weeks ago. When I tried to do the same for the node Ruby image, I did the first step on one of the latest image and there were issues with the node because that new image features SDF which is a kind of universal package manager that support multiple version of each tool. You can have different Node.js or Terraform or Ruby or whatever. The setup of these agents wasn't correctly fixed on CI Jenkins.io and we had to fix some elements. The main word things just in case, at least for Kubernetes plugin on Jenkins, when you specify a dollar, such as dollar home, the path is not able to interpolate, which makes sense because you don't want groovy or Jenkins interpolates the variable, you only want shell. However, the plugin literally write dollar home in the past value, which is never interpolated. So that one doesn't work. And the word thing is that the POSIX command, command-v always work with the bash till denotation while the wish, which command doesn't know how to deal with the till denotation, which is not POSIX. So POSIX is better at reading non-POSIX things than non-POSIX. That's absolutely mind-blowing. So if you want both command and which to work, to locate a binary in one of the directory, you must use absolute path or have a way to update your path dynamically, which is not something you can do with a Jenkins agent or a Jenkins pipeline. You cannot rely on that because it's non-interactive and by default, it's SH not bash. And it can be through SSH or not. So there are too many HKs. So don't try it and specify absolute path using path. That one was tricky, but the build is successful. So that's one step forward before generalizing our new images with ASDF. Thanks, Alex, for pointing that out and I'll be for the help. Stefan, a word about the LTS update from yesterday. Surprise LTS delivery. No, everything works fine. We did the grid. We had to restore twice for the CI to handle correctly the startup, but with no real causes. We didn't really understood, but we did update CI trusted and shared. Yes. Thanks a lot for taking care of that. We had an issue with Maven 11 agents, not spawning on CI Jenkins IO, that happened last Thursday. So it appears that we have set up limits on the Kubernetes plugins saying for the DigitalOscan cloud, no more than 30 pods at the same time and 120 on Amazon. Summary, don't trust maximum instance from any agent plugin on Jenkins. They cannot be really ably used because you are in a multi-freded environment and most of the plugins sucks at updating that shared counter between the multiple threads. So if suddenly you have a big load of, A, I want a pod agent please, but let's say three or 400 at the time, the system start to panic and have really bad behavior. I would also say that using multiple Kubernetes cloud to spread out loads between all of these looks like an edge case for the Kubernetes plugin. So combination of both made it not working as expected. We had help from Jesse Glick, many thanks Jesse, who pointed out that maybe we could try to also add Kubernetes resource Cota, which Stefan did, he applied Cotas. So Cotas are per name space. They can be done with CPU memory hard soft, but in that case we use the more recent one, which is the max number of pod running on a given name space, which absolutely map to what we tell Jenkins. So by enforcing the limit on both sides, that work as expected now, what happened is that Jenkins, if sometimes try to schedule more pods than it should, but Kubernetes refused the pod creation. So Jenkins then fall back to another Kubernetes cloud, which is what we want. Otherwise it went to wait on the queue. So the problem was fixed and it's look really good. That was also a good thing. Thanks Survey, thanks James Nord and thanks Stefan for pushing over on the data doc subject. Because now we're able to produce really nice graph such as this one, where you can see the dark light is digital ocean and the light blue is the AWS. So you see that it decrease following the operation by Stefan. Stefan added the quota and we triggered a bunch of builds. Now, as you can see, we have around 120 pod at the same time for AWS and around 30 for our friends digital ocean. Just to show you what happened before, as you can see the blue one here was a digital ocean last week when the problem appear. Now as you can see the red is AWS. Now that we have fixed the issue with resource quota. So that's a good news we don't have to think about a more complex solution. I'm happy with the outcome. So thanks everyone involved in that one that was quite tricky. Any question on this one? So the quota level that you set just to be sure I've understood the quota level you've set is all the way out of the cluster level at the Kubernetes cluster you say, look, don't allocate more than this. And then if Jenkins asks for more than that, the cluster correctly, the cluster APIs correctly say, no, I refuse. Exactly. Okay, and the Kubernetes plugin deals with that. It says, oh, my request has been refused. Okay. So I'm going to search for another candidate to schedule options, which is another Kubernetes cloud that might have some capacity left. And once you reach the maximum of all the available clouds, then it waits. It has a wait and retry. Okay. We still have an issue to open still on the Kubernetes plugin mentioning that it doesn't work as expected because it's a bug, literally. And we can now remove the limit we had before. I think we left it, no? The Jenkins one with the plugin. Chessian, I are in sync on the fact that we should keep both. Oh, so we, yep. We need to make sure to change both of them at the same time, so. Yep. Maybe an update CLI process could execute a script, retrieve the values required on all the repository and determine the new value and the date, propose updates on it. Okay. That's a good point. Can I ask you, Stefan, to open a L desk issue mentioning that new Maven just so that we think and we don't forget about that, please? Yes. We had a plugin installation request by GC to unblock a new feature on the BOM that involved a plugin that hasn't been updated since one year. Yeah. Jesse knows what, I'm not really sure, let's see. That has been installed. New environment variable, I'm not sure it's a big risk. I'm not a big problem if it hasn't been updated since one year. I don't know. Yeah, it's certainly not as, as not as general purpose as some other plugins that have previously been installed on CI.Jenkins.io. Yeah. I mean, by plan utility steps, for instance, should not have any problem. Not to mention any specific plugins that are a little too general purpose, huh? So I just realized that, let me reopen. We forgot to update the, because we forgot to add the plugin to the list in Poupet in the data for CI.Jenkins.io. We have a list of plugins that we recently treated and cleaned up, and we try to add a comment on each plugin to say, hey, that plugin is used because of that. That should help. For yourself. Can't log in on Artifactory. It was a problem between Chair and Keyboard. And Key Clock performance are really thick when looking up modifying users. So, Stefan, can you explain what we did to fix the issue around Key Clock? That was not too hard. We just migrate the database from Amazon to Azure, if I remember correctly. Yep. And suddenly, the latency goes down. Yeah. Thanks a lot, Stefan. So the new database is Terraform code managed so we can track it. It's inside Azure. So nice work. We can start again thinking about replacing Account.Jenkins.io by Key Clock. Now, Key Clock is usable. That means trying to find what were the missing steps. What were the missing steps. I'd just say we have to find what were the real issues, fix them, and then dug a tomb for Account.Jenkins.io. Even if it's running with JDK 17 now. Still. Now, the open issue. First, we have a lot of people with issues where they password. That's weird. Yep. We have the signup issue on the Account tab since at least one or two weeks. It's not explaining these new issues but it's part of them. Yeah, it's one. But it feels like... I don't know. Was there a change inside Jenkins itself in the Help on the login page that could have pointed users when they are running on their own Jenkins instance? They forgot the username and instead of going to their own helpdesk inside a company that could lead them to our helpdesk. I don't know. I'm not aware of any such change but I'm going to do a quick search of the source code to see. Yep. There isn't only one command with an helpdesk into it. Yeah, and it's in the Jenkins file. So no, it's not coming from Jenkins Core. I don't know. It's a good question. Where did, what motivated those users of a Jenkins controller to think that the Jenkins for Helpdesk could help them? But most of the users of these users doesn't even answer to our request to, okay, you have lost your password. What is your account? Worse, I see at least one. The guy said, I don't know what is my username and neither my password. You need to recover it. Right. But the correct answer back is we've recovered it but we can't tell you. Because that would be a security breach. Therefore, we don't want to breach the security. It's been recovered. Thank you for asking. So that's why we still have that issue that should be taken on next iteration at an helpdesk template for a country recovery issue. You've opened the report request and you've just draft for now. But I have asked some review. I have included you, Mark, in if you can take a look. So I like that. Cool. Thanks, Harry. So that means up. So that might not solve. So we have this one that should help. And as Harry said, there are errors in accounts when people are creating their accounts are created. So maybe we could update the helpdesk template saying, okay, if you so HTTP 500, then try to recover your passwords with the email you used. At least that should give a bit more autonomy in that area. But we have to fix the 50 issue. So I'll if it's okay, it's assigned to me. I will take care of the 50 issue unless someone want to call the account Jenkins.io. You are welcome to do that if you want to try. Okay, so we'll take care of this one. Um, Stefan, one quick one. We have the migrate key cloak database to Azure. That's the relation of the old database. That's why it's still open. I was waiting for a full week of usage. Which we will reach tomorrow so that should be okay for you tomorrow. Is that correct? Exactly. Cool. Um, next one, private gates. Can you give us a status about that that you are looking on? Um, I have some issue with the network part on the identity part of this cluster. So you already created the cluster. It's code managed almost. But you have issue to reach it, right? Yeah. Disabling the launch the authorize the IP launch on the cluster, I can access it and deploy charts on it, but it shouldn't be that way. It should be IP restricted so I have to run the execution. Okay. So I assume you might want to pair or do you want to continue on your own? No, I'd like to pair if you don't mind. Okay. While we're in that area we saw an issue let me a a a So there's been an issue open last week by a gentleman add a a a records which is a DNS record for IP v6 address. So since the first sorry since the first September India is IPv6 by default all of India the internet provider in on India so that has been a federal government plan since month years as I assume they plan to remove IPv4 absolutely from any of their get ways in 2025 the gentleman who opened that issue points that it can be troublesome in some case from India to reach our mirrors and as a more general fact it doesn't seem that our Kubernetes cluster public gates which host the mirror at least the entry point for the mirrors but also some elements such as plug-in Jenkins IO Jenkins IO backends accounts Jenkins IO as well and other public services it's not IPv6 compliant so it's not only adding a record for DNS we need also to implement at least an external load balancer that expose a public IPv6 and then act as a gateway it looks like that that change is not that easy based on the first documentation I could find on Azure we need to create full dual stack virtual network and then enable dual stack on Kubernetes level on these networks then you don't have to use IPv6 internally but you need that in order to have dual stack load balancers politically I've stumbled upon this dual stack stuff when studying this there is a preview function allowing us to use dual stack container on Kubernetes what sorry can you repeat feature preview a bit of what of the dual stack for the container on Kubernetes I'll give you a link what problem would it solve for us it's a feature preview that would let us run dual stack yes I think there is a misunderstanding we don't care at all using IPv6 on private networks as far as I can tell but we have a lot of for public services on Azure cluster yes I still don't understand knowing how to activate dual stack on more cluster can be useful I don't know no okay so let me repeat virtual networks dual stack virtual network are feature complete since 2 years in Azure but we missed that with Olivier AKS dual stack by default is working and is not a feature preview what is a feature preview is using IPv6 internally to the cluster when you want pods to discuss with the CNI network that's why I've put a limit if it's a feature preview then we try not using them but we require at least to create a new network from scratch you cannot convert existing network then convert create a new also cluster from scratch dual stack and that one allows you when it create load balancers you can associate IPv4 public IPv4 and public IPv6 to the load balancer while keeping internally everything private in IPv4 does it clarify what I because I might not have been precise before and so the feature preview is having the ability to do IPv6 end to end until the container itself which is only an internal detail for us eventually I saw a note about internal load balancers we might check because it seems like if we don't use IPv6 everywhere we might have issues with this one which is what we use for private engineering thing I posted a link in the private chat and they are also speaking about load balancers we will check that later ok but yeah IPv6 is a subject that might need some attention right now not because we need to change everything in emergency but because we are in the verge of creating a new private cluster then we will create a new public cluster we already have IPv4 issue on the current networks so that mean we might have to plan creating element from scratch that's the best time to do it because we are starting so no emergency to do it but if we have to do it better to do it now to avoid breaking everything in 6 months which mean maybe we might have to rethink some part of the topology of the cluster you are creating we might need to recreate the private networks we will discuss this later sorry we will discuss this after the meeting yep is there any question about that I am having issues with my copy and past ah finally key clock is ok windows agent on CI Jenkins IO disconnect prematurely Stefan can you give us a heads up on that one if I remember correctly that's the one using the label windows only as it's it's not specific enough we cannot make sure that the node that are disconnected which one are they because the label is used in multiple places so the aim right now is to try to avoid using that label which is not specific enough and to find where it's used to change it with a more specific one and then clean that label am I right looks good as it makes sense for everyone so about the issue you saw Marc looks like the windows label was trying to allocate randomly different ACI container but the logs show that when the acceptance test was passing there were other builds that already filled the maximum amount of windows container that we were trying to have the thing is that this stack this stacks looks to be the same issue as what we have with Kubernetes plugin Jenkins was trying to create additional ACI while it should not but the amount of build was just tearing apart the multi-thread things so that thing is quite annoying especially for sure will happen on other plugins you check that it wasn't happening on a plugin a few minutes after but the build way was passed so we cannot be sure that some users of CIG can say you're iron suffering from this one so we there are two roads here to ensure in long term that it doesn't happen again there is that plan to move ACI agents windows only to Kubernetes node pool so we could apply the same solution as what we did with the resource quotas or we can also check right now if there is a way to add quotas on Azure already I won't check if we have we have cloud level quota applied to the whole accounts but maybe we could say okay CIG and Kinsayo with the use of technical accounts that technical account should be able to spawn at maximum 20 pods however we could check this one so we have to work on that one we delegate the the Kotal limit numbers to the API of the company we use instead of Jenkins because of the multithreading that is leading to error when it has to deal with that in complement not instead better to have both still important that Jenkins knows limits because it can go back and clean it up but we don't know if ACI provides such feature resource quota thank you okay next one CIG and Kinsayo stories is not handling pool requests opened by Gevin that one we are working on unifying the pipeline from Infrasci and CIG and Kinsayo so CIG builds our public and you can have build logs and are run on CIG and Kinsayo and Infrasci do exactly the same step but only on the master branch on pool request it generates the preview sites because the mechanism responsible for deploying preview sites must not be on CIG and Kinsayo because it requires a token to allow deployment to whatever Netlify environment if that token is on CIG and Kinsayo it's virtually already act so that's why we split that part so Infrasci should build and deploy on the master branch and build and deploy on preview environment for every pool request CIG and Kinsayo should build tests for any commit so that end user can have visibility on what they do the challenge behind is oh should we use the same Jenkins file as much as possible and we need to ensure the same agent, same template, same tooling on boss controller so Stefan and I are paring along the road on that one but it's not an easy one so it takes some time any question on this one? what are the ones left? Realize repo Jenkins CIG mission and Mark and I started experimentations we haven't reached a level of confidence good enough yet on that topic on the way we should change the structuration of the repositories on Gfrog and we were slowed on between the elections and today's advisory so we'll continue the upcoming week Publish pipeline stepdog generator extension indexer so that one was closed why is it reopen? I don't remember they want unifying pipeline like we did for stories is there any new topic that we should be aware of? one, two, three no? no is there any question? okay I think I've taken some notes I will upload everything thanks a lot RV for the notes again that's really helpful so let me stop the recording so first screen sharing no more then stop recording and see you next week