 You're up. Damian. Yes. So hello everyone. Welcome to the Jenkins infrastructure team meeting for the week of the first week of August. So let's start with a few announcements. So the weekly 2.305 has been released. So congratulations on the release officer and everyone involved. I assume because I did not take part on that part, Mark, can you confirm there was no issue on the infrastructure part for that release? I haven't double-checked doing the weekly release checklist yet. So I've got to check still the Docker image arrived. I assume that you and Tim were discussing Docker image build, but I haven't checked to see if it's there yet. So need to, whoops, Mark, to run the weekly release checklist. The update is 20 minutes ago. You saved the website. So that's good. I mean, while we're here, I'm just going to go ahead and check Docker hub. Let me see. We have the tag yet. It's there. So I cannot confirm that we have all the expected tag, but at least we have the weekly tag 2.305, slim, alpine, 11, 7, Selma Linux, and Russia. Excellent. Okay. So yeah, it sounds good. That means no publication issue. So the last elements that we tackled down with Tim have been fixed. So for the record, depending on the kind of hypervisor and VM and cloud provider, when you enable Kemu, the setup of Kemu being enabled might be persisted across reboots or not. So just to be sure, Tim did the force enabling it again before any pipeline and it's okay. So if the release went successfully today, it means that Tim's fix has been successful. So yes, thanks, Tim. So Damian, does that mean we've done a multi-arch release as well, or not yet? This is still single architecture. Not yet. However, the images have been built and tested and rebuilt on the publication environment, but not pushed. If I am correct, this information, the publication, I'm not completely sure. I need to remove the dust on my Raspberry Pi testing. Okay. All right. So multi-arch images have been built, but not yet published. Exactly. I need to check the code or ask Tim for confirmation. Let's say that will be a goal for the team meeting next week. That's an information I'll share. And maybe the next weekly could enable pushing this new experimental architecture. That would be exceptional. That's been on our roadmap for an extended period. That would be a great accomplishment. Now the second announcement. So we have to prepare for the next LTS release that will be end of August. So that one might be a bit tricky. Mark, I will let you on the line if there are elements related to the infrastructure, but the JDK11 is the main one as far as I know. Right. Yeah. So we're changing JDK11. The plan is JDK11 as the default JDK for Docker images that do not specify a specific JDK. So for instance, latest LTS, LTS-alpine, et cetera, 2.282.302.1-slim, all examples of that. And so that's a question for me. Does it mean that we keep the old JDK8 still being built, but with the suffix, the JDK8? Is that correct? That is the proposal. And that was what was discussed at the contributor summit. And that's the phrasing that is in the draft Jenkins enhancement proposal that I hope to submit today. I'm way behind on getting that submitted. And I apologize sincerely to everyone for that. But Tim Jacome has thankfully looked at it and others have discussed it with me. So we believe we're OK for that plan. I'll publish it and we'll see where we go from there. OK. JDK8 should be kept but defined. Yeah. And that's your performance improvements are crucial for that because that means we're increasing the number of images that we're actually building. We've previously been building only alpine for JDK8, for example. And now we'll be building alpine for both 8 and 11. So because of thankfully DockerBuildX bake should help us still keep the performance reasonable. I think these are the folks at Docker that we need to congratulate for that tool. We do. We need to thank them profusely for that. And OK. So if you see this recording and you already are using Docker images, you can start as for today by adding the dash JDK8 if you want to keep using that image. So you will be sure that that that is always a good practice. But yeah, don't let this change by two and you can already change the tag. The image are the same. They're just aliases. So you can start right now to fix your dependencies. So you can take then the time to switch to JDK11. If you need some additional time, you won't be forced by that chance from the community. Yeah. So good insight. Thank you. And that's all that I had on. There are certainly other changes. The change log has not been generated yet, but we'll likely do that in next week's documentation office hours with Dheeraj Joda, Meg McRoberts and Kristen Whetstone. OK. I have one more add-on for us for the infra. We need to audit all the changes we did to the whole release process, including both the normal process and the Docker image publication. Because as far as I remember, the latest LTS, we forgot some that were done on the weekly. And so we add to cherry pick. I think that's you, Mark, who opened the code that issued during the LTS release. So it's not really time consuming, but still, if we could have that good target for the infra team, if we can audit any change that could help. I'm thinking specifically on the pod templates and the publication for multi architecture images. Right. So the Docker images, we build from a single branch. So that should be OK, because Docker build processes are doing both LTS and weekly from a single branch. However, your point is correct that release and package repositories for the Jenkins core release need to be, we need to be sure that we've kept those up to date. And that's a checklist item that I have not added. So let me give myself an action item, add to the Jenkins release checklist, because it should be there. And I had said I would do it, and then I failed to do it. No problem. Don't hesitate if you need help on that one. Don't hesitate to delegate. If it can help, I'll let you the judge of that part. I am happy to take care of it, particularly because it's one that I said I would do. I should do what I said I would do. No problem. There is no shame in asking for help or asking for delegation on this. Great. OK. So these were the announcements. So then about the, let's say, the weekly activity of the infra team now. So this is the nuts part. So first of all, a progress on Docker. So with the recent changes, Docker Buildings, et cetera, we discussed that last meeting. So if you're interested, look at the previous recording. So this week, team started to work on the Docker agents. And I'm helping him on that part. So the idea is to use Docker Buildings as the default builder for the repository Jenkins CI slash Docker agent, which is related to the base image Jenkins slash agents on the Docker hub. That image is the foundation for the inbound, outbound, the former gen LP agent and a bunch of more images. It's the base image that has Java declined on a bunch of different Linux distribution. So team did a cleanup on that one, enabled Buildings and enabled everything we did on the controller image, in particular, the test parallel. The goal is to shrink the build and test time for Linux. We are one test away of merging that part. So it's end of week. It will be finished. It's only a side effect of parallelization of the test. And we have identified the issue. We see exactly the same result. The time for the build part only has been shrunk fourth time from four to one minute. The impact on the test is not that much. It's 10% only. But overall, the build and test of all the platform on Linux is now at four or five minutes instead of 20 before. So same kind of improvement overall. Now Windows has to be the next port. And it also helps to clarify, as we did for the controller, the list of images supported on the tags. So the GDK 11 as default will be easier. We should not be able to catch some. And so we cleaned up and also used that to remove some unused or not maintained Windows image of the agent, such as a nano server on GDK 8 on Windows or the whole Windows server core. GDK 11 still has all these declination, but on GDK 8, they have been removed. So here we are. I think the next step next week will be the inbound and outbound agent. The discussion and the exchange have been started, had been started on the IRC about maybe merging all the repository. Right now, I'm collecting all the knowledge from Oleg and from other former contributors that will include you, Marc. On the reason, I want to confirm the reason why we split the repositories to see if there is a compatibility or not. The goal will be to have all the Jenkins agent image on the Jenkins slash Docker agent repository. And all the other ones should be archived in the future. So kept as it has knowledge with a message to say, okay, look at this one. And so we could benefit from a single process build that we don't have to duplicate or update everywhere. So less maintenance pain, easier contributor setup. So a lot of benefit can be added and a centralized configuration of the older images we have. So especially when there is a security issue that could help. Don't worry, we won't take any decision before starting an email thread and the discourse to collect advice to see what the community think. Right now, we are just trying to understand the reasons so we can then see if it makes sense or not to push that subject. That's all for the Docker progress. I don't know if you have question or okay. So for me, is there a potential that the single agent repository will also gain us some even further performance improvement because of common components between those between those agent images? Yes, because Buildix is able to understand the dependencies between images. There is a keyword that allows you to say that image depends on that image. In support, either the keyword, which is explicit, it depends on and it also supports multi-stage Docker file. Since we are using a bit of both for all the use cases, both are supported and so Docker Buildix will be able to have all the tree and we can release everything at the same time when we change GDK for instance. So the time between a change is asked, let's say a new open GDK has been released and the moment where all the images are updated, that time should be shrunk by merging everything. That's the idea. I'm not sure if we need to be sure and confirm that in practical. Yeah, that sounds very promising then. Thank you. Thank you, Damien. Congratulations to you and to Tim. What a great outcome and I'm looking forward to more. This is really wonderful and I agree with how easy it was to handle the tags thanks to that declarative syntax that is specified of which things are being built where that made it much easier. I understood it. I can make the addition. It just worked. Brilliant. Then the next topic is the replacement of the archive.genkins.io machine. So good job, Olivier. Olivier is not available this week, but he did a bunch of work on that part. So now it's running on Oracle Cloud on an IRM-based machine, which is good enough for the role here, downloading things on her sync. The synchronization of the artifact has been improved by Olivier. So we should see a shrunk time between a plugin has been released and the moment it's available on archive and then the subsequent mirror. We don't control the synchronization frequency of the external mirrors, but at least archive will be updated way more often and in a faster way. The goal of that machine is to decrease the cost for the infrastructure given that the bandwidth cost will completely change by changing cloud. And so it means that we might be able, the old machine on rack space is still up. We should stop it in one or two weeks, but we should be able to stop that machine that for the record has been created with Ubuntu 1204. So that was a nice machine, but it's time to retire your machine. Yes. One last thing on that part, it was all that work from Olivier allowed us to validate the new process on the Poupet code that we use on the repository jenkinsinfra slash jenkinsinfra. So we don't use staging branch anymore. So we explain why during the previous meeting we don't have staging environment to validate the changes. So it was only slowing us down without testing anything. So we decided to be able to deploy to production in a faster way. So if we break something, we are able to deploy the hotfix in a much more faster way and with confidence. There is still some work to improve the test harness. We are working on that. But it was Olivier sent, let's say, four pull requests and he was able to deploy in production the four pull requests the same day, which is a rate far, it's far more frequent and faster than what we used to do. So it allows us to be way more responsive on that part. And secondly, we started to update on the go a bunch of the dependencies we used on the Poupet stack after upgrading the Poupet master to the latest LTS minor version we had. We have a bunch of game dependencies on that repo. So we are updating them on the go, especially Poupet modules. So Olivier did a brilliant job also at starting that part. We are still struggling with the server spec parts, which is why staging and tests were not there for acceptance testing. So we are working on that part and preparing the upcoming Poupet and Yara upgrades. So not only we validated something real, archived less money, but also we used the opportunity to improve our process. So congratulations, Olivier, that was quite a huge amount of work and it has been well documented and shared in terms of knowledge. So that was really a good team effort led by Olivier. Yes. I did have one question. Oh, I did you raised his hand too. I had, oh, that was a clap. I had one question. So I was accustomed to, for a brief period at least, seeing archives.jankens.io in the list of mirrors provided by the mirror, the mirror stats. And when I looked recently, it's not in the list of mirrors any longer. I wasn't sure if it's intentionally not in the list, or if it's accidentally not in the list. It was, it used to be, at least for a brief period, it was the very bottom of the list. So if no other location had it, we would seek it on archives.jankens.io. Any guidance there, Damien? So the fact that it's not visible on the mirror list, Olivier told me that it's the current behavior. We'll confirm that once we will have stopped track space machine. I understand that the change to be made here might require to not rollback. So that's why we want to be sure that we won't have to rollback. However, archive is also still used as a fallback. If no mirror has a file, that will be used as the reference. So that's why it could have been sometimes on the mirror list before it was excluded on the mirror system by Olivier. But yes, it can be used as a fallback. Thank you. Okay. So Olivier has discussed it with you. There is thought about it and more when he returns. Exactly. I think we still need to catch that, to have a precise answer from him when he will be back. But yeah, he gave me that information to be sure we don't have to worry what he's out. Excellent. Thank you. Okay. That's all for me on archives.jankens.io. Thanks very much to Olivier for making that transition. And I'm excited to see the comparative cost savings and measure those numbers and see results. So now the next one is CI jankens.io. So since the past week, we were able to deliver the configuration as code as defined. We had to do that a bit earlier. So I don't know who is the culprit and I don't want to know. But yeah, I think we, the team or people having admin access to CI jankens.io, did a mistake. I'm not sure if it's a plugin upgrade, someone missing with the agent and cloud configuration. But the result was before previous week, weekly release, there were no more EC2 and Kubernetes configuration on CI jankens.io. I think that that was a plugin update that did not finish, but not sure. Because the configuration was not, has been deleted by jankens on the XML files. So we had to deploy the cask configuration support on Poupet, which allowed us to reuse the configurable that was taken the day before. So we did not lose anything. It was a kind of backup. And so we were able to, in less than one hour, make a CI jankens.io back with the correct configuration on the agent, validated by team. We have done one subsequent pull request to update the reference of the machines related to the Docker builds. We had to update the VM templates. So team updated the file and we were able to merge it on jankens on fraud and it automatically deployed the new version on CI after a reload. And thanks to team careful review, we finally implemented a cask reload because initially we were doing jankens safe restart. But the inconvenience of a safe restart is that the UI and the webhooks and points are not available for 20, 30 seconds on CI. It's quite fast. However, it's still unavailable. So now if we change the agent configuration, the configuration as code reload does not stop the service, at least for the agent scope. I don't know for other casks scope. This has to be verified, but yeah, that is really useful because we can update without being scared of putting CI jankens.io down, which is a good improvement. So the next step now, now that cask are first, I have to check with the security team that all their process when they update CI jankins.io is in sync with the changes we did and the changes we plan to do. We need to be in sync on that part to avoid someone thinking that they are able to change the configuration and their changes not being persisted by our system. So we need to double check that everything is okay. And if there are some points that we forgot. The second step is team and high are going to start working on moving part of the workload from ACI to Kubernetes. Why is team involved in that part? Because initially it was not part of that section. It's because he saw a bunch of errors on the build during the previous day on CI jankins.io. When rebuilding the BOM or launching the BOM build, there were a bunch of BOM that have the tendency to start a bunch of agents. And most of the agents were ACI and they were struggling for CPU, mostly because these are shared machine. So since ACI are a container, our idea is to start on some pull request of the BOM, which is a great candidate because if it works, then all the other builds should work because it's one of the worst. So we will want to try and experiment the BOM on a specific pull request that will use Kubernetes agents. Our goal is to transplant the configuration of ACI and translate that to the template with the same Docker images and run these on the current KS cluster. We might be able to do have something, however, in terms of performances. We don't know because the machines are static. We only have three machines, no toe scaling yet, because we decided to have a static capacity to avoid bad surprises in terms of budgets. So it's a step-by-step process. I understand that it can be frustrating for developers. So sorry if it slows you down. But our goal is to also not break the existing developer workload. We want to start with specific and surgical pull requests, and then we're going to grow that. The reason is because if it works with Kubernetes and if the only issue is that it's the capability of the cluster, we now have digital ocean and scaleway sponsorships. So that will be adding two new Kubernetes clusters with also static, but on different providers. And we can start mixing the workloads. So right now, the stretch goal is a specific pull request with specific labels. That means the other build won't benefit yet from that. Is there any question on these topics? So Damien, for me, there was one. So the decision then is to switch from using a label-based configuration to using pod templates. And I'm used to thinking in my simple mind of how easy labels are and how hard it is to express a pod template. Can you tell us more about what we gain by switching to pod templates instead of just using simple labels? So the pod template won't be the job of the consumer of the pipelines. The pod template can be defined at administration level on Jenkins or on a share library that depends on the two-use case. But we, the infra team, are still responsible to provide these predefined pod templates as a service, like we provide the ACI. And they should use, they can be associated to label. That means the pod template can be used if such label is asked. I don't know and I have no knowledge and I'm not sure it's possible. I'm interested in knowing that. If Jenkins is able to, if there is a rule, if we have ACI and Kubernetes with the same label, I don't know what is the rule. Is it trying to pack on first on ACI and then on Kubernetes? How does it work? I assume that it's an algorithm close to the try to reuse as much as possible where they have been success. So ACI will still have some way. So that's also why we want to first start with specific labels. We will define pod templates at administration level with specific labels different from the existing builds. And then the pull request will use these specific labels to be sure that we only use these pods for this pair. And then we will decide based on the result of that. Excellent. Okay. Thank you. So that clarifies for me. It is that there is the responsibility to define the pod templates remains with the infra team. And everyone else is just consuming those templates, not having to define them and learn all of the complexities of what it means to define a correct pod template. Exactly. The goal is to use an abstraction layer here, either a pipeline library or Jenkins config as code. So developer don't have to care about that part. However, developer right now are able to use pods if they are able to edit the Jenkins file. If you are a maintainer of a plugin, you can still already try your own on this. Well, yeah, but in that case, I have to be a maintainer, right? Because my Jenkins file changes as just a submitter of a pull request who is not a maintainer is just ignored until after the merge. Okay, great. And we have Falco running and we are going to add more and more restrictive rules on the cluster. So for instance, we are at the moment in time, we should and I think that's going to be the next step. We will have an allowed list of the images that kind of run. So if you want to try whatever new images, we should not be able. We already disabled, you cannot run a pod as privileged, for instance. And we will add a way more security measure on the cluster just to be sure that you might be able to use images or customize, but there will be a limit to the customization because it's a shared instance and we don't want everyone doing whatever they want. Right. Yeah. So this is that a technique that then helps us somewhat guard against somebody, a malicious maintainer deciding to use an image that is a crypto minor or something like that? Exactly. Okay. Also, this cluster are stateless using a case or whatever. So the idea is that weekly this cluster will be completely throwback, the machine will be destroyed and then a new one will be initiated and cask. But right now we have to discuss that. We should also be able to efficiently implement credential rotations also. So this is a multi layer situation, but we have to sync with the security team as well to see what are the requirement, what they feel like is mandatory, what is important, what might be less. Because sometimes it can be important from our point of view as maintainer of the infra, but there are other topics that are more concerning for them. So that's why we have to sync with them. All right. And finally, I started today to work on something we discuss, compatibility of Terraform 1.0. That includes Datadog and AWS Terraform projects. As for today, these are tiny projects. So mainly modules update and a bunch of syntax fixes and preparing a new version of the Docker image we use for running Terraform on our CI. The goal is that we should be ready to start resources on digital descent and scale way with Terraform 1.0, either mirror machines, Kubernetes cluster or any other resources in the future. Why Terraform 1.0? Because we are two versions away. There should not be so much changes, but 1.0 is a kind of LTS. That means the Terraform syntax should not change for the upcoming three years. If you have a Terraform project compliant with 1.0. That's the reason why we need to do that effort to be sure the maintenance can be then spread across multiple years. Okay. And that's all for me unless you have questions. I'm six minutes late. I should be... No questions from me. Oh, I mean, I had a question, but it is absolutely fine if we can take this asynchronously on IRC as well. So in the point of replacement of archive.jentons.io, you wrote synchronization of the archive has been improved. So actually I wanted to know how does the infra team kind of make sure that synchronization is in place. I just wanted to know the architecture behind that, but it is absolutely fine if we can take this asynchronous or answer later. So that's a set of shell scripts that are available on Jenkins infra. There are two synchronization, one that's up on every 15 minutes and that triggers an alert if it's not run. And the full synchronization of everything every hour or three hours. I don't remember exactly. Everything is based upon AirSync. Each time there is a new PKG distribution, so a plugin is updated, a core or whatever is updated on the reference, there is a file with the timestamp in Unix. So if you want to check, you have to compare the timestamp from PKG to archive to any other. And this is how our monitoring system, if after let's say a few hours the timestamp are still different, then it means that something is wrong and we receive an alert to react. So it's let's say AirSync shell and the timestamp based on text file. Thank you so much. No problem. You can still go on Jenkins infra repository. Don't hesitate to look at the code specifically for archive. This is Puppet manifest and you will see all the shell script listed there. Okay, I will do that. Thank you. You're welcome. If there is any missing documentation, don't hesitate to raise your hand on IRC and Olivia and I will try to complete write down and improve that. Okay. And that's all for me. And that's all for me as well. Damian, thanks very much. We're ready to end the meeting and call it done for today. Yes. All right. Thanks, everybody. Thank you. Have a good day.