 start the recording. Recording has started. How awesome. Hi everybody, welcome for this new Jenkins infrastructure meeting. Today for the announcement we have quite a few. So the first one is we are changing the meeting time. So several contributors asked to move the time because it did not work well for them. We want to have as many people as possible. So now we are going to do it on Friday afternoon. If we have to change it later on, feel free to request that and we can re-evaluate. But for now, Friday 2 p.m. UTC is the new time for the weekly meeting. The second point is next week we're going to have a Jenkins LTS release on Wednesday. It will be the version 2.303.1. What is important is we have one remaining issues that we have to fix. And more importantly, that's not the right time to modify anything related to the Jenkins LTS or the release environment or whatever. So we just ask to not change things that may affect the release environment. That's all I ask you. That's it. Otherwise, in terms of announcement based off last week's news, we officially start using Java 11 as a default Java version for our Docker images. That's part of the latest weekly release. I put a link to the blog post that announced that change. And also thanks Damian and Tim for your work with the Docker image. We can now officially support different architecture such as ARM64, PPC64, and S390X for the Docker images. And finally, the latest, the final announcement that I have to share is we are investigating some issues regarding repotagenkinsia.org with Trifrog and updating the service to the latest version available may fix the issues as they browse some improvements regarding the time that it takes to copy artifacts between MAV and repositories. So we are planning to do that upgrade next Thursday. That means that repotagenkinsia.org will be done for around 20 minutes. We still have to decide on time between Trifrog and us, but we are planning to do that after the LTS release. Any question? Just on the S390, I'm delighted to report that I've had personal contact from a person of IBM who's interested in contributing more on the S390 stuff. So I've been guiding her to a please submit an acceptance test into the infra slash acceptance test repository if you'd like to verify S390 installation and do those kind of things. So she's very interested in being involved working with her management on what it would mean to be involved. That's great. That's great news. Yeah, that's awesome. Any last comments before we start? So to the agenda, I put six to pick. The first one is about CI to Jenkins.io. So Tim and Damian have been working to deprecate the ACI. So that was for the Azure container instances. So a bit of history here is we were using container instance on CI to Jenkins.io to run various workloads like Maven, Agent, and so on. So we faced many issues in the past and it ended up being quite expensive for our use case. We are still in the process to reduce the cost of Azure accounts. So we decided to move more towards using Kubernetes agents. A lot of work was needed to switch to Kubernetes agents, obviously changing the Jenkins file, changing Agent, and running various tests. And it appears that we achieved a major milestone in that domain. So we are now officially using Kubernetes agents on CI to Jenkins.io. So we are using at the moment a Kubernetes cluster running on our Amazon account. And we should have more coming so we can spread the load on different communities cluster. That's what's coming. Damian, do you want to add anything on this topic? Thanks to Tim. That pointed me. We have updated runbooks and documentation. There is still one last step in the pipeline library. So the idea is that we are going to deprecate the attribute use ACI for the developers on the pipeline and use use container instead because the ACI makes no sense right now. And here I'm not really at ease about how could we help the hundredth plugin pipeline to update the parameter. So right now the depreciation is a soft one, which means the idea is that it will keep the old value and set the new value from the old if it still exists and print a warning on the console. Tim, it seems like you had some ideas or maybe experiences on how to do that because we have hundreds of pipeline to update. Not sure how can we batch pull request all those repositories. So that's a challenge we'll have to solve. Damian, isn't that just a matter of GHPR create kind of thing? So admittedly, it's a scripting challenge, but it's it's just a bulk scripting exercise. The problem still is we then have to persuade them to actually merge the pull request. And I would assume that's the bigger challenge. I don't think we have to worry about the merging it. We just create the PRs. Ideally, so I did this for the checks, your checks, dial and find bugs change. Some of them got merged straight away. Some of them took a few weeks. Some still coming in every few months. But I think after six to 12 months or so, I just closed them assuming they're abandoned. But most of them get merged and the ones that don't, if they've got use ACI, they're probably more likely to be maintained. But it's more just helping people rather than rely on it because there is a compatibility layer. I have talked about choosing the GHPR, but yeah, in fact, that's not that hard. I'm interested to play this around. Never did that with such amount of repositories. So if anyone is okay, I will want to try to do this. That makes the most sense. Just get the list, which shouldn't be too hard. And then your ideas team also to have a check in the pipeline library that will not only it will print. I like that idea. So not the idea is not only you print a warning, you should switch blah, blah, blah in the pipeline output, but also had some kind of GitHub check on the pull requests that will warn the end user that, hey, there is a deprecision on one thing that should be visible inside the pull request of checks. That should be quite easy too. And the idea is that right now, such check should not fail the pull request. For now, we send a batch of pull requests and it's in one month. We start after communicating properly. Of course, the idea will be, okay, we start to fail pull request because of that at the moment in time. So then in two months, that should be the deprecision campaign should be finished. We just do that automatically at work. We just set a date in it and then the code automatically changes from a warning to a failure so that you don't rely on someone updating it to fail. Exactly. That sounds a good idea. And in terms of costs, Olivier and I checked a bit earlier today. What we have on Datadog, so we saw that the auto scaling for that cluster is 50 machine maximum. It was completely arbitrarily. The goal was to be sure and convince ourselves that we were able to replace ACI by a single Kubernetes cluster. The numbers we saw on Datadog is that the worst case, we never reached more than 36 machines at the same time based on the auto scaler. So that gives us a first, let's say, area. We know by default we have only two machines. And the auto scaler stops the machine 10 minutes. That's the default value after no pods has been scheduled, which means the bomb builds or the Jenkins corpair builds that happened during the past 10 days, never reached more than 36. So that's two pods permission when you have these builds. So we have now an idea of how much worker we need on one cluster. But I've seen cases where there were upwards of 200 jobs in the queue, okay, in the days of ACI. And what you're saying is we never exceeded 36, and yet we did process a bomb. No, no, no, no, no, no, no, no, you're not the nodes. So what we were looking here is we have nodes and on one node you can have multiple jobs, multiple pods. And so what we wanted to test was to see, okay, right now by default we have two nodes and those two nodes can run multiple jobs. If we enable auto scaling, that means that if we don't have enough capacity of one of the node, we just create a new machine on the Kubernetes cluster. And what we check was we could go up to, so Damian put a limit to 50 nodes maximum to run all the jobs and what we discover. 100, 100 agents maximum on the Kubernetes cloud on Jenkins site. The idea is that worst case you have 100 requests for starting pods. And then based on the allocation on Kubernetes, the auto scaler decides how many machine it has, which means maybe we reach the 100 requests, but the time that the auto scaler starts to add machines, some of these jobs or steps have been terminated. So taking into account the worker time to be started and join a cluster and be able to handle some pod allocation, the workload we have does not need that much, does not need 50. That's a good indication if we want to distribute such loads on different Kubernetes cluster in the future. Because here, the idea is to have, let's say, to this cluster on digital ocean have one of these cluster on scaleway, one on Amazon, so we can spread on the cost on different cloud provider. So now the question is, what would be the size of the different Kubernetes cluster in order to handle the loads? And there is also something related to the orchestration mechanism on Jenkins, because you have the sticky nature of, okay, if I run on that kind of cloud, then there is a kind of sticky for the same job branch, but the hash includes the branch. So for a given, for master branch, for instance, that will try to always reuse the same. However, all the pull requests, the first build of a given pull request, select randomly. So the idea is, oh, are we able to balance between different Kubernetes cluster? That might be one of the challenges. One of the ideas is we will want to artificially decrease the maximum amount of requested pod for a given cluster. We have 15 right now. Let's say that we don't move. If we add 10 machine on scaleway, then we should go on 40 for AWS, which means if Jenkins asks for 50, it will fill the queue for Amazon, and then we'll start filling the queue on worst case on the second cluster. But my knowledge of the scheduling algorithm stopped there with the cloud in particular. I'm not sure how Jenkins handled it. So that will be an interesting experiment. Thank you. I'm just sharing some information. I'm not sure how fast it would be. I would like just to show you what we used here to analyze that information. So we use that to collect metrics on Kubernetes agents and a lot of different things, and what we were using to identify the states. It's Odin. Hoping that it doesn't take too long. So this one, this one is the dashboard that monitor CI Kubernetes, which is a Kubernetes used by CI, the Jenkins that I know. And so in this case, we were looking at nodes in a really condition. So you see that most of the time, we only have two nodes. And from time to time, we have peaks like 36 to 27. And our objective is to monitor, to better monitor this behavior to understand the right number of nodes that we want to have on the cluster. And yeah, each node can have multiple nodes. So let's go back to the nodes. Any other questions regarding this one? None from me. Thank you. Overall, as underlined by, so what you said, team Monday, it sounds like it's working. So if it causes any trouble on the bills, don't hesitate to raise the issue as soon as possible. But the time, the build time are a bit slower than with ACI because we have limited the capacity. But that's the same order. It's one hour, 20 minutes around. And it was one hour and a few minutes before. So ACI was unusable last few weeks. It doesn't matter the time. Yeah, exactly. It's an improvement anyway. I showed Damian my nine failed builds in a row for ACI issues. So it was unusable. So it's, it's working better than expected and it's solved some issues. So I'm quite happy with the outcome personally. Next topic, which is about archive the Jenkins.io. So several weeks ago, I deployed a new machine on Oracle Clouds. So the idea was to move archives of Jenkins.io from the Rackspace account to Oracle Clouds. It's working, sorry, it's working very well. Get the Jenkins.io. I had to do a minor improvement because it was not listed on the mirror list anymore, which I fixed beginning of the week. But the most important thing here is it's working very well and it's a lot cheaper than what we had on the Rackspace. Probably just because we never update the machine type on Rackspace. So it was a very old one. Of course that's, but what we were paying on Rackspace was around $800 per month. And at the moment we are close to $60 per month. I think one of the reasons why it's a lot cheaper is because Oracle Clouds doesn't charge for network bandwidth. I'm not sure how long they will do that, but at least for now it makes it really cheap to run on Oracle Clouds. It's very slow to download from though. I tried to just download a Debian file and it says 20 minutes to go. That's interesting. I think that's interesting. I think compared to Biff, yeah. I just need a link. If anyone wants to try to download that. It's insanely slow. So let me, so maybe we have some improvement to do there. Download speed to Snow. Yeah, you already put that information. So I'm just saying it's good though. We still have the old machine. So we can easily test and compare the speed. That would be really interesting. The thing is on Rackspace, we cannot stop the machine. So we can only delete it. So before doing that, I was just keeping it to what it is on test. So if you did take such issues, feel free to share. And I don't want to delete the machine too quickly. I can confirm the same as Tim observed. It's dramatically slower now than it was before. Not clear why. Previously it was much better bandwidth than it's going today. We have to double check that. But keep also in mind that we have some restriction on the service. So depending on how many people try to download package, that may happen because we have a specific configuration in Apache to limit the network bandwidth on the machine. And because it's used in the mirror, maybe that's the reason why it's slow. But that limit could be removed because we had strong, I mean, the network bandwidth was quite expensive on Rackspace, which is not the case anymore. And we have a bigger network interface than what we used to have. So yeah, we probably have some fine tuning to do there. Next topic, which is about costs. I updated the Google Sheets. So I saw some interesting behavior. The first one is the Azure account cost is decreasing. So compared to one year ago, one year ago, we managed to stay without NK. And then over the past next months, we increase up to 30 or 40K. And now we are going back to 10K again, which is nice. But at the same time, what I noticed is the Amazon cost is increasing, which makes sense because we are replacing SCI by AKS. So we have some improvement to do there. We were looking with Damian if we can save some money on the Amazon account, and it appears that some EC2 instances are oversized. So we could save some money there as well. But we have to give it another look next week. We cleaned up some old instances on the Azure account this morning with Damian. And we also have to do that exercise on the Amazon account probably next week. I have to say we have to wait the end of the month before the cost explorer and IWS reports because right now most of the costs come from data transfer mostly. And others, so we have to fine tune because the cost increase in only EC2 instances, if I look at the past six months, it's not that much. It's like less than one case in January. While the Kubernetes cluster has been quite used in July, but not that much as in August. So right now we are not completely sure about the cost increase while the amount for the bandwidth is still high. And since, as you said, we don't rack space, we still have the bandwidth out from package. So I'm not sure how much bandwidth out can be gained, but that's still the first cost center. However, for the cost of the instances, the first kind is related to CI Jenkins.io. I don't remember how we ended up, but the IMEm memory machines. So outside the fact that are we using all the CPU and memory of these machines? This is a question to be checked. But before that, we can decrease the cost of let's say 20% by switching to EBS instances because right now the instances we are using for IMEm have two NVMe SSDs. And the IOs on these machines is absolutely completely not saturated. So given the fact that these instances are reused a lot by a bunch of builds, we could switch to the same size of instance, but changing the IO and see because it's CI it's 20% cheaper for the same kind of CPU, same kind of memory, same kind of memory bandwidth. It's only the IO. And also I would like to to check the way these machines are used on CI Jenkins IO because it sounds like that they are implementing labels like Linux without the IMEm and we don't have an exclusive policy which means any job asking for a Linux label will be able to use the IMEm and keep these machines up. So I understand that reusing the machines is interesting because of the caching and once it started, but we are built for per second or per minute. No, it's per minute. Sorry, it's per minute for these machines. So it's not that much. But yeah, there are some jobs that should use slower machines that are let's say tannier. So the goal trying to first change the instance type and make the IMEm memory exclusive. So you should have these machines only if you need them and they will be recycled while we try to keep the tiny instances. That's the initial hypothesis, but maybe I'm missing something. So if you have question or advice or something that might bother you in this hypothesis, don't hesitate. And overall, if you're interested to look at with us to the cloud, to that account to investigate some ways to reduce the costs, feel free to manifest as well. So that's all for the costs at the moment. Next point, I wanted to talk about packages, packaging images. I'm not sure if I was the one putting that topic here. So Damien, do you remember? I think it was related to the Docker images, Chideka 11. Yeah. Or is it related to Eclipse Timurius and things? I thought that we wanted to talk about building Docker agents using backer. Yeah. I don't remember what was the context of that point. Okay. So if it's about the backer, the goal is to stop building Docker inbound agent from Docker files, but use the same script as we are using to build the EC2 templates or the Azure Shared Library templates. The idea is to say either your job run on a container, which is Maven 11 or Ubuntu machine, then you will have exactly the same way to install both GDK and Maven and all the shell tools that are inside the container. The rationale beyond that is that each time if we want to change something, a version of Maven, for instance, we have the new Maven 382 that has been released last week. We have two locations where to change it. And the installation of Maven differs on both, is it a virtual machine or a container? So for the user experience of a plug-in maintainer, we don't want them to have a bad surprise either if they switch to container virtual machine, especially if us we want to add temporary capacity, change cloud, or swap the implementation for agents. Since Packer is able to build Docker images, the idea will be to put all the definition there that will also provide an improved Docker pool, because there is not so much cache on our images. Each time there is a new image that is built, downloaded, there is no cache reusing, especially with all the patterns and the differentiation. So the goal will be to have more all-in-one templates. Thanks, Damien. Any question? So the next topic is about, yeah, I just took notes, IBM is interested to participate. I don't think we have much to say here. We already mentioned that in the announcements. The final topic that I just want to briefly cover, we had a meeting with the Linux Foundation several days ago to discuss the LFX platform security v2. So we don't have access to the dashboard yet. There's something coming. Ask on discourse about who was interested to participate. It appears that people with the right permission on the Kitap, the Jenkins Infra Kitap Organization will have access automatically to the dashboard. So we don't have to provide user account or whatever. We just have to wait for the Linux Foundation to share and to tell us that the platform is available. So as of today, we don't have anything to do. I already configured Jenkins Infra Organization to send data to the platform. But yeah, nothing has to report. On that one. So I think I have right permission. So I think what you said is that I will then be granted, I will be available to see the LFX platform security v2 dashboard when they've enabled it for us. Yes, but you will only have access to the Git repositories that we configure. So as of today, I don't know by heart, but it's like five. I just select a subset of Git repositories so we can experiment with the tool. So we are running out of time. I'm sorry about that. So quick questions before we close this meeting. No, then thanks everybody for your time and I'll see you on our CN next week. Goodbye.