 Hello everyone. Welcome to the Jenkins infrastructure weekly team meeting. Today we are the 9th of May 2022. Sorry, I don't want time to pass. Around the virtual table we have myself, Damien DuPortal, Hervé Le Meur, Mark White, Ison Deller, Stefan Merle, Bruno Verarten, Kevin Martins and our visitor, Sartac, is that correctly said? Yes. Okay, let's get started with announcement. The weekly core release 2.404 is out, at least release packages and Docker image. So I assume as usual, we have the release checklist item to be finished, changelog. As usual, they will be done a bit later, but that's a go for us to deploy that new version fully that will fix an issue we are seeing within FraCI, where Jenkins is having issues with Unicode characters in GitHub pull requests, only with the 2.403. So last week weekly, the new one should fix the issue. Do you have something else about the new weekly? Okay. Do you have other announcements? Okay. Let's have a look at the upcoming calendar. First of all, the next weekly 2.405 should happen next Tuesday, like every Tuesday. So that will be 16 of May, if I'm not mistaken. 2023, I don't remember when is the next LTS. We had one last week, so I assume it will be in a few weeks. So I will put it in A unless someone can find it. That should be at least three weeks before any release for us. Let's check if we have announced Jenkins security advisory. The last one was the 12th of April. So no security advisory announced. The next major event currently Mark White and Alex Brondes are attending the CDCon in Vancouver. I think the last day is today, and they will be headed back. I don't know what are the upcoming events where we will have a Jenkins team member. So we'll have the NAA, unless you know. Nope. So then anything else to add to the calendar? The next LTS will be the 22 of June. 2.346.1 2.346.1 Thanks. We have plenty of time in front of us for this one. Perfect. So let's start with the work that we were able to finish last week. We had a contributor, a new maintainer of a new plugin, Jack plugin, who looked like to have some issues and we were able to help them to release the first version of their plugin. Now it's more discussion about versioning. So we can consider this issue closed. Unable to create accounts, we have a user who tried to create the accounts. I'm not really sure. I think for this one, they had the issue, the logs of the application mentioned cookie. So in that case, when you have the cookie error as per the source code, it means that the user already had a session to account Jenkins IO on their web browser. That one doesn't happen often. Usually it's when you have multiple accounts and you have not unlocked your previous account when you try to create a new one. It's considered spam because that means you already are trying to create multiple accounts in an automated way in the same web browser session or the same current session. That might be another issue of course, but usually that's what the cookie means because the web server detects cookie with an active session. If you see that issue and you are human and not trying, try to clean up your cookies as usual or unlock from the other session properly. Then we had the Debian packages starting with no longer published. So that one is closed as not planned. I will continue with the list on the left if it's okay for you. Renew update center certificates. We were able to renew the certificate used to sign the metadata of the update center and the tools with success. So now we are four persons having access to the CA, the certificate authority in charge of signing this certificate. That means anyone from the team is able to generate a new certificate once a year usually, but only four persons on the world have the ability to sign these requests for generating a new certificate. It's now Kosukei, the creator of Jenkins, Oleg Nenachev, Olivier Verna. And now myself as an infrastructure officer. That CA is valid for five years. That means the four of us for the five upcoming years have to take care of that credential. So that one is okay. We were able to generate and unblock the old jobs and everything is back. No issue and nothing was broken, of course, for the end users. We had the delay of six or seven hours last Tuesday before there is an internal safety system in the update center generation that stopped building the update center and updating it 30 days before expiration of the certificate, which we met the 2 of May. We can change using an environment variable that has been logged on the issue for next year. I still need to update the calendar notification though for next year. Update controller to the latest LTS. Last Wednesday we had an LTS release. So we updated all our controller based on the LTS version with the plugins. Side note, that one embedded as your virtual machine plugin update, which had a breaking change. So we were beaten by this one, but we were able to fix that on all the controller in less than one hour. So we had a tiny queue of 15 builds on CIG and Kinsayo due to that. As soon as it was fixed, then the queue was treated immediately. Update center is missing weekly release. That one was closed. It should have been closed as not planned, but no action expected from us. It looks like that when we have a new weekly LTS release, we need to have the update center to regenerate the new version. And in order for that, you need a new plugin. So there is that time window between the new core release, either LTS or weekly, until there is a release of another plugin. There is no need to regenerate just for the core release. So that's what Mark explained and the reason why there were no action for us. The case here, just to be sure that everyone follows me, because it might have been too quick. Since we weren't able to successfully build the update center, no new plugins were updated. So the new core release wasn't taken in account. As soon as the update center was updated with the new certificates, then everything went back to normal. Any questions so far? Okay. Then we had dance and page duty notification on data dog warning notices. So thanks for taking care of this one. That's way less alert for us, especially when we have a machine using, let's say, 80, 81% or 85% of the hard drive. It's only a warning. It's not blocking. It's not emergency. It only decreases. It's a penalty for the IO performances. And thanks for adding that nice bot that every day tells us and remind us to check the data dog monitors for warning since we don't receive alerts anymore. That's all for the task that we act on. On non-plan, we had the Debian package starting with 2.0. We had a user that was using an accidentally working URL for the Debian packages. They should not and they have been told to use the official URL somewhere else. So they should not have any issue in the future. We had user that looks like thanks for taking care of that folks. Forgot to use an MMPassword. Someone messed up, mixed our L-Desk with their company L-Desk. So they wanted to get access to their own Jenkins, but yeah, nothing we can do for these people. And again, missing update center for 2.0. However, that's what I explained earlier. That one is focused only on the update center generation and the other is a consequence of the certificate. Any question whatsoever? Okay. So let's move to the work in progress. First of all, I'm taking in the order on the left. Increase disk space for system pool. So we have a private Kubernetes cluster hosting private Jenkins controller. And the system pool is a node pool, a collection of virtual machine where the technical services, let's say, you can see that as plugins, the plugins of the cluster in charge of taking care of the data, for instance. It's running on these virtual machines. These machines have initially 30 gigabytes of system disk. We receive alerts that we have warnings and alerts that this disk are quite often used. And we are almost full. The goal is to increase the disk space to let this machine have the expected performances. So we tried just before that meeting with Stefan. And we had an issue with this one is changing the default system node pool of an AKS cluster. It requires destroying and creating a brand new cluster. That looks like to be an Azure constraint. That means we cannot increase that disk. We want to try because the documentation is not really, let's say, rich. We are missing information. We might be able to use either special Terraform attributes for letting the Azure API, they can create and recycle between two system not pools. Or we can try creating a system not pools on our own to see if it works. Still not sure how we could do that, but might need to recreate. Changing this resource, require, destroy, recreate of the cluster. So another solution here is that we don't need that much of that. Most of this data is consumed by the Docker images of the services running here. So I propose that we instead of focusing on this, we try moving, we have another issue. We discovered that some of our bots are running on these not pools and they should not. They should run on the Linux pool. So if we don't have this not pool anymore, we expect the disk usage to be decreased. So proposal is that we first move the workloads that should not run on that system pool. Eventually you might need to add a specific system pool for that. And once we don't have any of these images and Genix ingress, the free bots and maybe a fifth, we only let the Azure default DNS and CSI pods, then we shouldn't have any warning anymore. Does it make sense? Do you agree on that or do you have other ideas? Okay, so let me, let's start moving up workload to decrease this usage. By the way, we took the opportunity to upgrade the Linux pool disk size. That was set to 50 gigabytes and we were using, we were using 80 81. So we are just on the limit of the warning. A nice reminder from Stefan is that in Azure, the way you pay for the odd drive or the SSDs, you pay for the next limits. So we were using 50 gigabytes. That means we can increase until the limit of 64. We will pay the same. You don't pay depending on the size you pay by, I forgot the English words, but it's a 32 and 64. So whether you have a 33 gigabyte disk or 64, you pay the same. So that's why we increased it to 64. That should also remove the last warning for these missions. So upgraded Linux pool looks good for you folks. An issue. Okay. We had a user asking for their building a plugin on CI and can say, they have an issue. So what they are doing is that they need a jar dependency from Maven on their plugin. And the jar dependency is part of the GitHub source code. Their POM XML had a Maven repository name, which has a specific IDs that points to the local files. So if you build locally on a machine that work and what I discovered is that our setup with the mirroring is trying to retrieve the artifact. From ACP from our artifact caching proxy. I would have expected Maven to detect the file dash column dash dash and implies that it's local so it shouldn't but doesn't seem to work that way. So I've opened to pull requests to be reviewed by you folks that adds two IDs to the list of exception. The one they use on the one and local both names means a few you can define a custom one to be excluded. So anyone else will be able to do the same trick. So once merge, we should be able to try this one and validate the issue. So yeah, I need a review and that's all and then I will take care of text testing text testing that and go back to the end users. Any question. My great trusted CI Jane Kinsayo from AWS to Azure Stefan, can you give us a status report of the tasks that's your biggest. Since the end of last week, we now have the three VMs set up in Azure in the same network with oldest there. Subnet, we have the Boons VM, the controller VM and the permanent agent VM. Now I need to still work on the network security configuration and then I will have to start on the on the specific migration of data. Next, work on security groups and then start migrating data. Okay, cool. Start preparing the migration first, but yes. Migration data. May I ask you, you can start also the puppet port. I don't remember if I've written that, but I validated it locally with the vagrant virtual machine that Ubuntu 22 with the Jenkins controller profile on Puppet works perfectly. I mean, it's it's Docker container so they should I need your help because I've never had it a new VM in Puppet. So I know that we need to register it in the Puppet master but may need some. Okay. Both everyone, I know how to do that. I don't know just information. I don't know if he or if you I let's let's see, but which both of us should be able. No problem. It's written that you need help to bootstrap this part makes sense. Puppet works perfectly on Ubuntu 20.04 for the Jenkins controller. Go to add this VM one security groups are okay. And before migration, of course. Yeah. Any question on that one? Nope. Okay. Can't access apply tools account another user for another plugin release. Most of the time we have issue. While we have your Kevin and Bruno, I might have a request that I'm making the request without thinking but we might need help. And we might need to catch these two contributor of two different plugins to check if we if there isn't something on the documentation to be fixed and improved fixed because we discovered that the factory has changed its UI. So part of our documentation explaining how to get your maintainer Jenkins encrypted password to put on Maven when you do a Maven manual release. That one doesn't not seem to work anymore for the UI. We might need to check with G frog, but or at least remove the part of the documentation that guides users step by step with the screenshots, because it's a brand new version of Artifactory. The curl common line still work perfectly. Yeah. So that one is fixed and the rest these users seems to have it a lot of e cups. And started from scratch. So that one might be the best target. But that user did not read carefully step by step. So a lot of mistakes were human. But that might be a way to improve the current documentation for brand new contributor. I'm not sure. Yeah. Do you know if the UI at Artifactory JFrog is stable enough for us to start the documentation or should we wait. I have no idea. I guess it. Right now it might seem it's exceptional in the sense that they they are advertising since December about the brand new web UI that you can switch to. And it looks like that the ports where you you go to generate the Maven settings file that one despite sticking to the classic UI is still now using the new I and I think that's the breaking change that recently and expected breaking change. But I might be wrong. So there is that fix about the UI. Because that one we need to fix the documentation even if we say hey it's command line only. I don't honestly don't mind. And the other one is maybe we have room for improvements, but we need the time to check on this one. I just wanted to share with both of you. Thank you. If you have any questions don't state even if it's now or later on this topics. Both users were able to release their plugins. I haven't closed this one because I wasn't I haven't checked the last feedback from the user though. They should. So that might be close today or tomorrow. No expectation. Okay. Cut that check one last time. Any question. Okay. The next one make environment and description fields mandatory for bug type issues. So as requested by Alex, he wants to add two mandatory field when someone opened a bug on the Jen Kinsey tracker on Jira. That has been it looks like there has been no objection on the mailing list. So I tried to. Yes, sorry. Daniel made an objection about the environment. I think there was an objection for one of these two field. I might have missed. Okay. I will look with Daniel, but in any case I need help from Daniel or Mark because I tried to understand Jira and I failed. And I'm not really willing to learn Jira. Honestly, it's my brain is not wired for that. So I've asked Mark for help on that topic. Ask Mark for help. So when Mark will be back from Vancouver. Thanks. Survey. Looks like. Which is problematic as it would need an excessive list environment. And this list is not static. Okay. Technical. It has been explained in the mailing list. Okay. I think I will try to move that away and let the board or the core developers with admin rights because not really related to infrastructure. And honestly, I don't know how to do that. So I need help from someone to teach me. I can fail to deploy artifacts. That one should be closed as well. I need to check one last time if it's okay for you. I closed it. And you, is this person continue to ask you questions about versioning. And you continue to answer them. But I don't think it's. I don't. I'm not sure. No, it's another. Okay. No, that's still seren carbonates. So I should be able to close it because it's duplicate of. Oh, no, it's manual. Okay. Okay. That person is still having a misconfigured main and settings good, good points. Okay. So that one is open and need work from us or pointers. We have a password resets that should be okay. No answer from the user. I will wait 24 hours before closing without answer. I'm not sure what's the problem of that person, honestly, because they even gave the link where to resets. So. I haven't seen any mail sent for after a password reset. Okay. Thanks for checking. Can you had a comment on the issue to tell that's no email? Where's there the account exists with this email? I haven't checked that. I checked the account existence. So both of our diagnostics points to that the person never, either the person never sent the form for a password resets or the person is not who this, who they pretend to be. So in other case, without any answer, we'll close the tomorrow. So no, no expectation. This one past releases sites are taking long time to load. So that one, I will move that one back to the backlog because it's working and we were able to give a solution for the end user for the problem specifically. Initially, it was answering 503 errors and was slow both the errors are gone since now three weeks. And one of the two links is working really fast and the other links is slow. We need to prove, we need to prove that's where are the performance e-cups to see how we can act. So that's, we need for that to use data doggies. So that's something inside the end of the survey. If I'm not mistaken, that's correct. Just a minute. So, yeah, very is that you in charge of connecting Apache for the mirror. With data doggies to collect metrics. Specific Apache metrics. No, I haven't done anything about that. Okay. Are you willing to do it or shall we take the issue? Yes. Will you have time for the upcoming milestone? Maybe. We'll see. If it's a maybe, then it's a no and we move that to backlog and we'll see later. I don't know. Okay. So if you don't know, it's a back to backlog. As they are no emergency or blockers now, because we cannot take 20 or 25 issue without being sure if we can have them. That's the idea. So I prefer if you don't know, it's a no. And then we don't take it and we do what we can. The next one, unless there is a question on the performance issues. Oh, I haven't shared the solution. That person was trying to programmatically list the weekly releases from and they use the link from the download page on Jenkin's website. The area is I pointed them to the Maven metadata XML file from artifactory, which is the source of truth. That's the first thing when we have a new war released there. So they should only have to curl that XML file, parse it in XML and extract the list. We already have Shell code doing that for the official Docker images. That's what I pointed to that person. Looks like they are using Golang, but I mean, that's a file to get on parse and treats. So they have a way to have the core X source of truth with acceptable performances. That's why it's not a blocker anymore. Good for you. Then analytics before I haven't heard back from Olivier about that we need to migrate some Google analytics. It's named properties. Olivier gave me another person administrator on permission on that property. But we need wall administrator. But it does not seem that Olivier has the wall admin. So, yeah, we'll see. But at the same time, Gavin proposed again to go to use Matomo, which will be a self-hosted analytics platform instead of Google analytics. So that could be interesting because he tried that successfully in parallel for the past months. So I asked him if he can open an issue for Matomo listing the technical requirement to see which cluster we should host it. We need a database. It's self-hosted, et cetera. How much data do we need? So that will mean a new service on the public Kate's cluster. No, if we can import existing data. No idea. You can ask him. I mean, we still need it. It will be still nice to have the access to analytics. Even if we don't use it after that. Yeah. But yeah, in the worst case, we install Matomo and we start from scratch on a new set of data. I mean, we can always request the data until we start using Matomo and then... So, yeah. So that one will go back to the backlog until we have an answer from either Gavin or Olivier. Because there is no action expected from us. It will go back on a milestone if we have to act or something. Worst case, Google Analytics will continue working. It will automatically migrate in June, I think, or July. Even though the scary message here, the details and the administration tells me it will be automatically migrated. But with new, eventually new settings. That's why they recommend you to do it manually and check. I have no idea of that piece of crap work honestly. So that's why I'm... If we could get away from a Google piece of crap service, I would be really happy to be quite honest. Yeah, we might lose a few features, but I mean, collecting data for analytics is not something we should do. So the advantage of Matomo, it's clearly way more powerful in terms of respecting private life of our users. But we have to host it and to store the data. That's the double-edged sword here. So that went back to backlog until we hear from news. Until Olivier and Gavin answers back. Any question? To be quite honest, I have no idea. Are someone here using the data from Google Analytics for Jenkins platform? Those data or just... The data from the Google Analytics for the Jenkins platform. Never did. So it might be useful, but I have no idea what data is tracked. I assume it's something on the Jenkins. So your website tracks user and their path, but no idea. I'm not sure what would be the goal, honestly. Worth asking the question. Maybe it's a naive question. I don't have knowledge of these tools. So how would you use these tools, folks? That's not for us. That's for marketing purposes or... Yeah, that's not for infrastructure or for... The idea is to point out where in the world we are trending, where there is, I don't know, interest. And there's a lot and a lot and a lot of data in Google Analytics. So someone who knows how it works can extract some sense on it. But as you said, we are not the one that will use it likewise. Yeah. It's like CIG and Kinsayo. We don't use CIG and Kinsayo directly. We are not plug-in maintainers, but we need to know the use case of the person using the infrastructure we maintain. That's why I'm asking the question. Hervé, do you have previous experience with Google Analytics? It's in context, so... Okay. I will ask Givin on the IRC channel. He might have answers for that, and I assume not. Okay. Next issue, currently open. Use a new virtual machine instance type for CIG and Kinsayo. Work in progress. Sorry, I've left the lock on Terraform Azure. I need to start working on it again. Was able to find a downsized instance with a bit less memory. The price difference is low. It's like 40 bucks per month for the VM itself, but that will be a virtual machine of last generation. Size found. Yeah, I have almost everything ready for creating an empty shell the same idea of what Stefan did with Trusted. So we should be able to do that soon. VM ready to be created. Need pair and review. So almost there. I don't have any specific question for this one. No. Okay. Next topic, my gut application from System Pool to Linux Pool on private gates. So the scope is the three bots that we are hosting. I assume we weren't able to work on this today and it was a long weekend in France. So I assume we have to work on this one. Is it still okay for you, Hervé, to take it this week? Yes. Okay. This one. Delayed by core resist. And long weekend. That one should solve the issue of System Pool using too much data as reminder. So that one is important and we should max it out. Can't create account as usual. I will pass on this one that needs to be checked. Add the chamber to agents. Hervé, could you give us a quick status? It's your big central task. And I've got this module installed on a Nano server image, but I've got two errors. Not looking, but two errors. When I run this module from Python directly, it doesn't work when I call launchable alone. I'll try to put the DLL corresponding to the two functions. I see in the reward in the system 32 of windows, but then I have another error related to core CLR, which is power shell core. Okay. Do you think that we need to say okay, accept that we will have to use windows server core instead Nano server? I don't know. Maybe it's acceptable to get two errors listed in the log. If my chamber is working as intended, maybe we don't care about these two errors. I don't have any information about it. Okay. Can you check with Basil if it's working for him in the current state? I propose that we release your latest changes with the DLLs to at least fix some of the errors. When I'm adding the DLL, this error is blocking. I'm sorry I misunderstood. The question is with the currently deployed image version, we should have launchable with the initial error, but that should not be blocking because it still looks like it's working right? It seems so. Okay. Can you check with Basil the next steps for testing and if it doesn't work, we'll be using windows server core and that's all. The impact means a longer startup time for the agent on ACI, but that's something we are going to have in the future because we need to move these images to Packer, which require building on windows server core. We cannot use Nano server for all in one agent. At the moment on time, we will have to migrate windows server core. On the perfect world, we should do that once ACI workload will be moved to Kubernetes workloads because that means once we have the image on a single node, the image will be cached, which is not the case with ACI, which don't know the image on each call. That should make it pretty invisible, but if we have to, we'll have to start with windows server core error at first. Is that okay for you? Or do you want to hand over the task because it has been a lot for you on that part? No, it's okay. Okay. Let's... Zoom is capturing my keyboard. Let's check with Basil with the current state if it works. Otherwise, we'll have to use windows server core as base image. Just one quick question. Part of this means fixing the Docker in-bound agent repository. Do you think you will have time to work on this today or tomorrow? It's not blocker, it's just a question about your timing. Yeah, it's okay. Thanks. Artifact caching proxy is unreliable. So, we add the DNS issues below, haven't seen more issues on Digital Assign, which is a good news, and the Azure port should be solved by the CI Genkin SIO migration. However, based on the latest changes that Team Yakom did on the Azure VM plugin, we should be able to already migrate the CI Genkin SIO agents, at least the virtual machines, to the same network as the ACP. So, that's why I'm keeping that issue open. That should be separate tasks because CI Genkin SIO virtual machine migration might be done quickly, or maybe it will take weeks. I'm still not sure. So, if you don't see any objection, I will work on migrating the fmr agent or anyone interested can take, but as part of the Artifact caching proxy reliability with the new inbound Azure VM. So, that's why I'm keeping that on the incoming one. I will add a command feature. It should be able to migrate CI Genkin SIO fmr agents to the ACP network. No question. The next one, add a cluster public gates. I raise it okay for you to start working on this one. I don't remember. I might have said I wanted to start on it and I did not have any time during the past week. Yeah, we have to make the migration plan and start playing it. Okay, we can do one service per week or that kind of... It's a rule of thumb that should not be a problem for most of the services. Plan migrations, pick a service, move it and iterate. Okay, is it okay if I add you and if you need help for review, don't stay to ask. Looks good for you? Clean up and import and manage Datadoc monitoring in Terraform. Can you remind me the status of this one? I haven't worked on this. Several unmanaged or duplicated Datadoc monitoring and I've created this issue for Terraform. Okay, do you think you can work on it on the upcoming milestone or should we put this one back on the backlog? We can work on it. Yes, okay. So we keep it for the next milestone, temporary name resolution, failure and plugin bomb builds. So we need more investigation on the core DNS. Embedded in the EKS clusters. So we already removed Datadoc agent custom custom props that we're making request outside on CIKs need to do the same for EKS public. Otherwise, it generates a bunch of false positive errors. Still, we still need to understand why core DNS component is not able to resolve all the DNS requests outside. Is it because not powerful enough? Was it a transient error on the AWS network? We don't know. We need to go deeper. There isn't anything of use for now. We talked about security groups blocking the outbound DNS request. It does not look like it's the case. So we'll have to try a bit more with debug container on the node pools and see what happens. I think that's all for the current milestone. Stefan, I talked to you about IAM 64. Yeah, probably forgot to put it on the milestone. Okay. So I won't put it on this milestone since it's not finished, but can you report just where you are? We'll add it on new items. I managed to try the new Azure IAM 64 VM that are built by a packer. And now they are used on infra.ci, the Jenkins.io. It's working great. So we will be able to remove the Amazon AWS last element in the infra.ci with those last IAM 64 VM. So it's for infra.ci and packer. Next step for infra.ci remove AWS IAM 64. Okay. Is that okay for you to then continue on CI Jenkins.io and add the new template and remove the former AWS? Yes, it was pleasure. Do the same Azure for IAM 64 and once it's done, let's smile, clean up packer image to remove any AWS code. We don't want to build AWS virtual machine image anymore. Yeah, we have the CLI update too. Looks good for you to add this on the upcoming milestone. On the upcoming task, I'm going to work on this because I already started Ubuntu 22.04 campaign. I've removed this one because I knew I was off, so no time to work on this one. So the next step here will be the node pool on AKS. Since we are playing around with the system pools and everything. The goal is to see and start creating new system pool and new node pools and migrate everything in green-blue to use Ubuntu 22.04 instead of 18 on this one. I won't go further on that topic because we already have a lot of things. Do you have other topics that you see that you will want to work on on the upcoming milestone or that looks like important for you on the backlog? Let's have a look at the incoming issues. We have a new issue registered with the wrong email accounts, password resets. I don't see another new triage issue. Anything else you want to work on? Do you have other topics you want to bring? I'm going to stop sharing my screen. If it's okay for all of you, we'll see each other next week or later today or tomorrow for the people who work with me. Have a nice day, have a nice week. Thank you.