 Hello everyone, welcome to the Jenkins infrastructure weekly team meeting. So this week we are the 25 of July 2023. Last week we didn't have the weekly meeting because it was holiday for a lot of us. So today will be a two weeks milestone conclusion. Next weekly meeting should be next week, the first of August, as far as I can tell everyone should be there. So who do we have today? We have myself Demandu Portal, Hervé Le Meur, Marc White, Stefan Merl and Bruno Verarten. Looks like Kevin isn't there. I hope he's doing, his back is feeling better. Let's get started with announcements. So today's weekly did not happen, but it was expected. Yes, because tomorrow as we'll explain later we'll have a security advisory on the Jenkins core and so the main branch is locked. That has been announced publicly yesterday, I think on the mailing list. So that's why we try to not publish a core release before a security release to avoid anything going wrong. So weekly no release. No release today as planned. See below. Okay. Do you have other announcement folks? I don't have any. Except tomorrow security release, but that's part of the calendar. Okay. Next weekly should happen next week, first of August, I don't remember the number that should be 4118 or 19. So tomorrow's is 416. Next weekly will be 417. 416 tomorrow. Then 417 next is 1st of August. Right. Cool. Thanks Mark. Next, yes. So 2.401.3 tomorrow. 1.3 tomorrow. 2.414.1. August 20. Sorry. I missed for 14. That one. On August 23rd. So yeah. Perfect. For 3rd. Correct 23. Okay. I need to train my ears to English after the holidays again. Sorry. So Mike, I had... Vente 3. Isn't that right? Vente 3. It's difficult. But I heard people were thinking also of the 415 to become the LTS, but finally your decision is made and it will be 414. Okay. Well, so the discussions were really good about that topic because 416 will include a security fix. It must be back ported. And there was an issue that James Norr detected that will result in a back port as well. So there will be some back ports to 414.1. That's good. We're really proud that people are finding things that need to be back ported. That's healthy. Oh, I missed this discussion. Thanks for folks. Okay. So that need we will have to check a bit more to see if we're not impacted. Yeah. So what happens is the fixes will be applied to weeklies and then as we get to the August 9th release candidate date, those fixes will be back ported from weekly to the 414.1 or 414 stable branch. Okay. So August will be a busy week, a busy month then. So the announcement we tomorrow we will have a high severity security release as announced. Yeah, 26 of July tomorrow confirmed tomorrow. So that means since it's a tomorrow, we should try as much as possible to not release anything on release. And we will. So I need someone to help or prepare that can be me, but that can be someone else. But we need someone to take care of opening them. The usual. Statues Jenkin say you a message. Because CI Jenkin say you will be down during the restart. And we also need. So unless someone volunteer, I will take care of doing this to check tomorrow morning in Europe. And report back to the security team saying, okay, our system of green, we can proceed. I mean, Okay, so we need status. I'm happy to open the status for request. I'm not willing to wake up in the early hours of the morning. I'm happy to open that tonight. I'm not willing to wake up in the early hours of the morning to do the things that you need to do tomorrow morning. Okay. Let's check the platform. The fan is it okay for you if we pair on doing the East checks tomorrow morning? With pleasure to make sure that I know the process. Yes. Report to. Gen sec team before. Okay. That's all we have. Then we will have to follow. The updates update all the images. There will be an issue and we'll deploy. So the goal is to have. I propose the challenge tomorrow and of day in Europe. We will have upgraded all of our controller. And when I say, oh, it's all because both release lines are covered by the security. Looks good for you. No question. Next major events. I don't know when I haven't tracked this. If you have any next major event where you know some team member will be present. I don't mind you sharing it. I don't know what are the plans Mark. Do you have events in mind? Let's see. September. No, nothing we should put there September DevOps world tour start begins September October and I think even into November but that can be listed later separately. So let's get started on on the task that have been done during the past two weeks. So we had a user. We need we was it by the anti spam system. So we created the account manually. I've closed the issue but as a survey mentioned earlier today, that's worth it checking the username on the data log logs, the data dog logs of the account app to see what is the error message. That could have been triggered the anti spam. The goal will be to be sure what is the reason why the anti spam system blocked that person. Let's check that that dog. Survey. Are you okay to check the logs for that user or one of the other because there has been also other issues with this this week. Is that okay for you to check to check this for account app. Even if account has been created. Okay. The goal is just to be sure that we don't have a regression in the account app or a new kind of issue that we haven't forced in before. No question on that topic. Okay, for a management. So the goal was that was a request from Alex about this course, community Jenkins IO setup. He wanted a new category and ask for a bit more permission to the token so that has been done. The administrator, any administrator of the community discourse can do it. I'm not at least given an IR administrator and Olivia Verna. I don't know for the others. But everything was done. So unless you have questions, I am going to move forward. Okay. Rename pipeline log from the cloud watch plugin developers. Wow, that's quite deep pipeline. So also thanks for the people who took care of that that was renaming repository and Jenkins CI organization. And I assume the corresponding change on CI Jenkins. Next topic cannot download Jenkins for war file and plugins. I don't remember the reason of the offer that user. Okay, so that person had issues due to their internal firewall. I guess probably due to the public IP changes, I guess that happened during the Kubernetes upgrade two weeks ago. Or maybe not, but that will be one of the use case where since we changed the public IP, the firewall are keeping the old IP that has been trashed as okay and the new IP isn't on the system yet. That's why we need to take care of communicating and avoiding changing these IPs. Maybe the guy who broke the production. You were testing your ability to rebuild everything. Yeah, from from zero to fully working production in less than a day. Finally, the last closed item. Oh, thanks, Mark and team for taking care of that. That was Jira setup. Is that correct? Yeah. Okay, so I understand that mark. You held on that setting to keep the sanity of our users. Is that is my understanding correct. It was it was self preservation because I hated the default as much as everybody else did. So that was the craziest default. What, yeah, so I grew it was it was pushing me over the edge having to put up with it so Thanks for taking care of this one. So we had five issues about closed as not planned. So user trying to create a Jenkins account on account Jenkins IO which makes sense on the paper but that was for testing Jenkins or it's by the anti spam system it's not always clear. Thanks for taking care of this and mark. I think it's related to the we have at least two cases with the anti spam system as far as I can tell, we might have more so let's check if there is something wrong or what is the reason. My guess is public IP but I don't remember exactly. So, let's see. Question on these topics. Okay, so walk in progress. Most of the work that has been done during the past two weeks were on these tasks, some are almost close able. Let's take them one by one. A report about the upgrade to Kubernetes 1.25 because we didn't meet last week. So during to during the beginning of that milestone we plan to upgrade Kubernetes 1.25 with one cloud provider per day. Everything went properly until the last one which is public. Okay, so the upgrade in itself went very well. But as part of the upgrade, the change of the node pool combined to the particular network setup we had made the cluster in a state where it wasn't able to reach some IPs causing some mayhem. So when we try to fix it, a combination of bad luck and not enough reading and taking time led to cluster full destruction. So we had to recreate the cluster from scratch at the moment during the day. The first positive thing is that all the work that Irvay and Olivier did during the past years on the public cluster with automation and as code show that we're able to recreate it from scratch, including restoring the LDAP backup. I have to admit I was quite freaked out by the deletion of the LDAP persistent volume. I was recreated and as part of that recreation we realized that the public IP were changed automatically. So that that is a subsequent topic that thanks for from team, we were able to see we could have a long term solution on short term we have locked in the Azure API the current public IP. It cannot be deleted, even if we do the same operation deleting the cluster and recreating it. The under the hood, it's because we have some kind of namespaces in Azure API named resource groups. And when you create an AKS cluster there, you need to resource groups one that you define where the control plane is created and some top level resources and the public IP should have been on that one that you manage. And the second one is created and managed by the cluster itself, where the node pools and the virtual machine and the associated resources are creating within. When you delete the cluster, the cluster cascade delete the child resource group. And we were under the impression that the public IP must have been on the same resource group as node pool. As team showed, we have ways of doing of having the public IP on another resource group with the correct annotation on Kubernetes setup, which we didn't know back in time. So when we deleted the cluster, the public IP were deleted. So they have been created on the same resource group because we had to recreate the cluster as close as possible as the former one. But now we have an upcoming issue tracked on a subsequent topic that could allow us to not only keep the lock on the public IP, but also move them to a proper resource group to avoid blocking deletion of the cluster in the future. Thanks to the thanks survey and thanks team for pointing these elements. Thanks for the help mark and every on that topic. Consequences that we had to change the public IP a blog post has been published about this to communicate to end users. And this issue is closeable, except that I need to open the issue to plan the one done 26 with the date time and linking this one that's the one last mile. All the other improvement that we've seen have their own issue to be tracked because it's outside the topic of the updates. But these are things that we should work on to improve the next updates or the next production outage. Is there any question, things I could have forgotten? Not clear on that topic. In the in the postmortem, you had put back Java doc 2 RM 64 not pool. Yes. Was that related to the grade to 125. No, it's because during the cluster recreation, the Java doc service wasn't triggering a scale up of the cluster. Maybe it's because we didn't add enough time or I didn't diagnose and I started by changing the node pool label to force it back to Intel. Okay, we recreated everything. It was Friday and the next Monday or Tuesday, I fixed the problem by moving it back to RM 64 after scaling manually like you did the first time. Okay, my guess it's it's because I didn't wait. I didn't took the next the required time waiting a bit more would have helped but in any case we had to rush to that way. Okay, good, good. No problem. Thank you. So yeah, good, good points. Good question. But it should be in the same state as you left it when you went to vacations. I don't have anything else on Kubernetes 1.25. Anything else for you folks. Okay, next one. My great updates Jenkin say you to another cloud. So for this one during the past week, we removed all the Oracle cloud resources that Stefan and I created the month is ago for that topic because we decided it's all together. During the beginning of July that we won't use a record cloud for that service. Are they can you give up? Can you share the status since you are going to take that issue as we discussed, can you give us a summary? Yes, so we intend to create the image with a batch server in it. And create a service on public K this cluster with a custom volume to move to every access files generated by the update center script. The script is triggered by the build on trusted that side of Jenkins IO to update the version in the access reduction. So the plan is to retrieve this. This is the access put them on the app as to server configuration running on the big gate as cluster. And redirect with HTTP to the content stored in R2 cloud flare R2 buckets. The good benefit of cloud flare R2 is that they have we can deploy a bucket in China. So this will help for user. And we can also use these with mirror bits to have a better location for your user. And yes, the main reason we are using R2 from cloud flare is that there are no egress fees. And since it's one of our main bandwidth consumer service, it's a big, big win for us. Yeah, I forgot to. Now you mentioned the two images. I have a little bit here, I forgot. No problem. Do you read only a state docs will be also another benefits of having this service like that. Do we have an agreement already from cloud very clear cloud flare to allow us as an open source project or is that in progress. For now, their free tier is more than enough for conducting more tests. Then we will have to assess if we are in their free tier before normal usage. If not, we can, we can not subscribe but fill their open source program form. Anyway, we can, we can do that. I think it can be beneficial for both Jenkins and cloud flare. Yep, the goal is to be sure that we don't have any technical blocker. Because if we start the discussion with them and if we realize that we cannot use them that will make no sense. Good reminder. Longer term. We, so that's a discussion that we had the China project. And right now we will, we will start with only one buckets in the US and we will start moving the service because the main goal as a reminder is to get that service away from AWS to avoid paying five to seven K per month in aggressive fees. Once we have enabled that, we can think if we can use air to cloud flare or if we keep the, at least the pattern of HTTP redirector, meaning Jenkins controller starts connect to update Jenkins.io web service, which redirect them to somewhere. And that somewhere served the, the, the four or five megabyte JSON file, which is paused and used to then download plugins core and tools. If we have the, so the question we had two weeks ago was, is Jenkins able to follow HTTP redirection for that and the answer found by Irvys. Yes. Because we have all of these HT access that you mentioned that are generated by trusted. So that allows us to keep control of the domain name update Jenkins.io and update Jenkins.io.org the certificate and the redirection rules. We keep that control. And then we trigger HTTP redirection so that the real agress bond with is served by a service where we don't pay, but we keep the control of the, of the initial service, which is not a reverse proxy. Otherwise we will still keep serving the content and paying agress fees. That's the subtle difference. That's not an easy one. Considering that pattern, we can think about improving the life of our user all across the world. We have China and because that's where it's the most visible today, but that could be improvement for everyone. We could use the same, we could use a mirror redirector like we have for get Jenkins.io. Except in that case, we must manage and feel the content of the mirrors between codes. Meaning we could think about having a bucket on the US East and buckets on China on the air to cloud flare China network. And then we could instantiate a new mirror bit service for updates Jenkins.io that say, oh, I see you are located in China. So the redirect will be on the update center Jesus served inside the China network. But the problems we would have with the existing service is that we don't manage the mirrors, for instance, our own university. We have also a cell. So we cannot control when we updates for real life. The update is on file, which is a problem for security. In that case, that will be another mirror bits instance that will allow us to project the proper updates on file on the location we seek, but we control feeling these elements. So trusted CI today, generate the contents and then run a nursing to the virtual machine. In that scenario, in the long term future, trusted CI would generate the contents and run AWS free CP or a local CP on the file system and copy to all the location where it's needed. And then we can update on the mirror. We can even add a CDN on top of that to manage the one that cannot have any. That one is introduced some lags in. Depend of the CDN. Yeah, the grass band with the cost that will generate the ability of the CDN to be cash. Yeah. But yeah, that could also. But if there can keep everything and use it as a last solution, not to not to propagate everything and to be close of geolocalization. Yes, but you depend on all the CDN are resolved the location and currently fastly doesn't have anything in China. So that's why if we control the mirrors, we can have a mix of all of these while the CDN doesn't. Okay, I like the way that we handle that and and last resource if we cannot have the bucket close of the client, we still have the ability, depending on the price and the system to use a CDN as the last solution. Because because we got the solution in fact we are handling that. With our own mirrors to project on different network. Okay, do you have other question on that topic or does it looks good for you something not clear. Okay, so that's the main priority for every because that's quite the cost and we want to get rid of that virtual machine. So that's the our new top priority now. Next issue. Artifact caching proxy is unreliable. If I understand correctly that issue is close able we are waiting for Basil just to do a final validation since it has been tested the merge one or two weeks ago. So we use the new ace the ACP artifact caching proxy on the wall at age builds now so that should decrease a bit more. What we download from G frog Artifactory. Is that correct RV or did I miss something. Thanks for taking care of that so that also underlined something he done is that now we don't have any more overlap issues. And we have a must removed everything we just have a few elements I will come on this a bit later but that proved that the network or the virtual machine we use and see I can say you as for today is now stable enough to sustain ACP. Close able waiting for final confirmation from developers. Thanks for this one. We have an issue Jenkin server is unable to download plugin from updates Jenkin say just okay. So the user seems to have issues connecting to the archon university and mirrors is redirecting them both mark and high have at least have checked that it's not a problem on our mirror or on the archon university the files are there we can access them on multiple ISPs. So it looks like they have an issue so I've invited them to contact directly the administrator of the archon university network. They will check if they are denied if their request are denied on a run or if they own internal firewall is blocking I don't know but there is nothing we can do to help them outside this. If it's okay for everyone I will keep that issue open but I won't put it on the next milestone is that okay for you. And I'm adding a personal reminder that I should check for this in one or two week and close it if we don't have any answer. So users have to contact archon university. Nothing else from us. Okay. We cannot do any patch on the mirror or director because it's based on their geolocation. So there is nothing we can do about that. As far as I can tell. Issue while creating Jenkins infrastructure account. I guess survey that's the one you mentioned this morning. Oh no it's a new one. Okay so we have a potential new contributor so we have to take care of this one because the user want to create their account. I've assigned myself this issue are they are you okay to assign this yourself since we I asked you for checking the logs. And do you mind creating their account to unblock them on short term and a long term we keep that issue open to keep the log research as a night as a working item is that okay for you. Yes. Cool thanks. I don't think we have a problem. I see 80 warning for the last 15 days. It seems normal for me since they are not only I can't issue but other stuff in this one. Okay. Long term. Let's check the logs to see if anything is word. With account. Okay. Issue about 80 HPs become unresponsive so there have been multiple factors for this one. There has been no regression in the 80H itself but that has been fixed. And still issues have been cooked by by James. So thanks for the pointer from our VAE we conclude as team that we could temporarily disabled the spot mode for the agents. And we'll do it for the upcoming week until next milestone and we'll see if it changed the behavior of the 80H builds. We don't change the behavior that mean we have to search of the root cause because some element feel like the same problem we have with the bomb builds as if some threads were stuck while trying to garbage collect to manage the agent connection. So we have changed from SSH to inbound agent for these agents that could be a cause. Maybe the inbound serialization protocol is weaker than the SSH protocol for reconnection or supporting the network issues. And finally, if the spot instance improve then we will have to study carefully the size of the instances because as pointed by our VAE. We have to check the distance size because it's nine times cheaper with a low eviction rates. But now that will be clearly way more expensive. So we have to check if without spots is it the best fit for this workload. So we need to check the metrics in data dog and reconsider the choice we made, which is not a problem. But that's why I propose a one week period and we have to test the 80H as much as possible. Is that clear that it makes sense for you? Or do you have question or objection on that proposal? Okay, so our VAE you've merged the thing. Can I just let you comment on the issue and then we will have to wait. Is that okay for you? Disabled for one week. Let's check the results. There's a request mark for diesel low. It's a Jira request. Oh, okay, no. I can take care of this one. I might need help mark just in any case. I might ask you for help tomorrow if it's okay for you. Yeah, I'm I'm not a high skills dear administrator, but I'm happy to attempt it together. It's it's only a category so that one should be easy. But if I'm blocked, I will ask for help. So I'm taking this issue. Okay, next issue is Jenkins CI failing for Jenkins plugin after a change in Jenkins file. So I'm a bit annoyed by this one to be quite honest. We discussed it. It's the third issue as far as I can tell from that plugin maintainer. They say they have a problem. They work different element to fix their problem. And we don't have any feedback until three, two or three weeks later when they open a new issue. So there might be as pointed by team that might be bug in the GitHub scanning system from CI Jenkins IO. Is it misconfiguration, edge case, a bug on the plugin itself? I'm not really sure. Since this user haven't carefully follow the repository permission of data elements, they have managed their whole system. And because that plugin was a fork, migrated to organization, there might be a combination of parameters that will help us a lot if the user was responsive, which they are not. So I propose that we keep that issue open. Or we can close it if we keep it open, we have to tell the user a place we need your help because it's a word case and when you need to be responsive or we cannot tell you. Because it's a bit infra, it's a bit Jenkins CI organization administrator CI Jenkins IO administrator. I mean, it's it's cross team. So yeah, we need we need them to help. So question. I had an experience today that might align with this. Okay, so, so could you scroll up to the top to see the description was it that that they had permissions on CI Jenkins that IO that they didn't have on GitHub or the other direction it was that they had permissions. It was that CI Jenkins that I did not realize they were a trusted person. Okay, so that's different than what I had I had a case where, where GitHub thought I was not not trusted or depend about thought I was not trusted but then I was allowed to merge the pull request myself so my, my situation is different. Thanks. Nothing to do this. Okay. Yeah, so that one. Yeah, the suggestion from team makes sense but we need the user to answer. So, is that okay if we keep it open, and we don't expect the Jenkins infra team to work on it, unless there is something that points to CI Jenkins IO setup, but I mean, that's the only user so there is something really agey here, I guess, I guess team, team at the good feeling but yeah, I don't see anything we can do to help them more than what we did. Do you agree or do you see something else that we could do to help here? Okay, I will keep that issue open and I will put it on the next milestone and we'll see. I propose the time the TTL will be next week if we don't have any feedback we close it saying okay we need a feedback and we need you to be responsive when we're going to help. You ping them before you don't wait just a week you just ping them and then wait a week. Yeah, we add a message now telling them please we need your feedbacks. And then next week we will close it if no answer back. Next issue, assess artifactory bandwidth reduction option. Mark, can you give us a quick summary on what changed during the past week on that topic? Yeah, so had a discussion with James Nord where he identified a very rapidly implementable technique to reduce bandwidth use without requiring a release of all palm files and without requiring adoption of new palm file releases for everyone. So I've scheduled the session for tomorrow to discuss it in more depth. It's described in this in this issue ticket there. The idea is we just password protect our cache of Maven central. That's the only thing we password protect on this effort, but that is by 10x larger data volume than any other repository that we cash. And so so that one change but because that change is because that repository is automatically included as a fallback by Maven. Our stopping caching it will just cause the builds to revert to asking Maven central for the artifacts, which is where everyone else asks anyway. Oh, we don't have to say go to the original main one it's automatic. Right. And now that only works for exactly one repository. And it happens to be a major high visibility repository it doesn't work for JG it doesn't work for many others but for this one it just so happens that the data volume from Maven central in our in our measurements is 10x greater than the next largest mirrored repository. So so it's it's a very interesting choice if this works, we should implement it. The question then from the security team is, or the question will breach with the security team tomorrow is, is it okay that we're relying on Maven central for artifacts, and that we are going to be able to pass through potentially other repositories on the search process. And, and immediately, it will just be repo repo dot Jenkins CI dot org, falling back to central. But if JFrog requires more than that then we have to put intermediate steps and there's some danger there on supply chain. So feels feels comfortable feels reasonable. We'll discuss it further tomorrow. Sorry, I have maybe a dumb question but you were questioning the security of relying on the Maven central because something could happen between us and Maven central in my right. No, not not so much that because we're, we're talking to Maven central through an SSL secured HTTPS connection so it's it's not that someone could get between us and them in the data transfer. It's rather that there is a search order, which is used to search first this repository then this one then this one, and Maven central is always the last one searched. So, so the benefit for us today is our cash satisfies all requests. But the problem is that means we have a full cache of Maven central sitting inside our repositories that other people are using without without justification. Sorry, I have another dumb question which may be linked. So you said that it would be password protected now from now on. Okay, so who will use the password. I mean, will our Jenkins plugins build you that password that it doesn't change anything or will we also use for the Jenkins build plugins, the Maven central which means that our builds could take some more time. Build would go ahead. What I understood is will build will still use our own cache, but this cache won't be available for anyone else. Well, the password, the password be careful that password will be hidden inside the ACP the caching proxy that ever builds, which mean CI Jenkins I will only use the password to reach the ACP. So CI Jenkins user will never be able to see the password and get it in anyway it will be hidden somewhere else which is not reachable. They will only be able to get the password to connect to ACP today. That means they can reuse it at home and use ACP from outside. But we should quickly shift to ACP restricted only to see agent inside your agents. Okay, much clearer. Thank you. Be used with the as fallback. One thing Mark, we be careful about the tests suggested by jumps we need to do that test as a first layer, but we still need to plan a brown out when we will enable authentication of the only on the upstream repo Maven central inside the tree, because that's the mirrors will have different HTTP answer than when it will ask for a password. And the fallback behavior cool change if we have if it received a different answer from the remote the world search fallback system. I'm sure James is absolutely true and that should be enough. We need to run that test, but we still have to anticipate word fallback behaviors. Wholehearted agreement there we we must not implement this without a brown out first, and we may need multiple brownouts to be confident that it's behaving the way we expect. Absolutely. No question in my mind that once we get agreement on the concept brown out is the next step. And then we we and we we define I think carefully what the things we want to test during the brown out, because do we need to make artifact caching proxy changes in order to do the brown out or can we run it unmodified. Those kind of those kind of questions. Okay, thanks. In our info for reliability. But password is hidden inside ACP. Not inside CI Jenkins. Okay, do you have another question things to write here. Cool. So that's good news. I don't say we shouldn't have a highly available held up, but at least we won't be forced into doing it in the last minute with the if that if this is working. So that one is a consequence of CI Jenkins IO virtual machine being migrated to a new virtual machine to three weeks ago, and we also move the agents that were changed. They change region subnet and connection connection method from SSH to inbound. ACP and everything works perfectly but uncertain test of the ATH a protocol name, a network protocol name HKP use for using for retrieving public GPG key from public key servers. They are used on some of the Docker files of the ATH. And the bills were failing because I admit that I've been a bit paranoid and the firewall rules on the new network for this agent is now for bidding everything by default except some a few rules. So I expect the agent on CI Jenkins IO to only use outbound HTTP or HTTPS. Eventually, we could think about SSH for GitHub to GitHub.com public IPs or eventually some Git labs. But the goal is to be sure that we don't have agent trying to do word things. It's not absolute. Of course, everyone can can forge a word thing through. I mean, I've used SSH through the port 443 to bypass firewall rules in my previous organization. So I don't say it's impossible, but defense in depth states that, yeah, by default, if you don't need it, forward it. Also, the big deprecated message, please avoid using public GPG key servers is a message that should say, hey, let's avoid using HKP protocol and instead let's copy the public GPG key next to the Docker file and use it. So all of these information have been passed. I need to check, but that one is closed able. Any question, things unclear suggestion objection on the wall firewall, HPH thing. Okay, next issue AWS summer 2023. I've closed the previous issue that was spring 2023 because we already were able to decrease the bill. And now here is the usage of June 2023. That's a follow up issue. The next actionable we have or we have fall for that summer. So for me, summer is until middle September. So for the upcoming month and a half, more or less, we'll have to work on these four issues. That's a kind of epic issues. What is what are they described. That's moving away the AWS machine to decrease the outbound bound with costs. That's the update Jenkins. The date center index move to Cloudflare and somewhere else, as you can see on the diagrams. That's the data transfer outbites. And I haven't given more details. It's on the other issues, but it's only updates Jenkins IO virtual machine. Please note that also I will mention that we also have the PKG origin Jenkins IO. But since that one is backed by fastly a CDN in front that shouldn't be a egress consumer. That one shouldn't generate a lot of outbound bound with because fastly protect us. So that one should also migrate to Azure. That's a separate topic, but part of that bullet. The goal is to decrease by almost half the AWS billing when we will have finished with that machine. That one is the biggest one. Also, we have two quick wins. We also have two virtual machines for the services usage Jenkins IO and census Jenkins IO that could be moved as inside Kubernetes or as virtual machine. We need to evaluate for both services, but this one are running on AWS and there is no need for that. Since the Azure billing is now way below the limit, we can afford moving these machines and that will make us less coupled to AWS. Mark, we have a question. Do you remember what is the purpose of the service census Jenkins IO? Because we are not sure. I don't remember, but I can do some research here really quickly and do some checks. No, no worries then. It was just if you had it in mind. So that means for the whole team, we want to check census Jenkins IO. Usage is quite easy. So there is already an issue open for that since years by Olivier himself. So I think we can start working on these two and then we will evaluate census. So for the issue about the S3 artifact caching. As we saw two weeks ago. Yeah, basically with the S3 artifact manager and CI Jenkins IO, we don't. We should not let the plug in delete artifact when the build are rotated for a lot of reasons. So we disabled that behavior, which is the default and recommended behavior. So that means we need to create a garbage collector system in the bucket where the CI Jenkins IO artifact are stored. And that artifact should we should define the rules such as a delete any artifact that is older than one month, for instance. The goal for us is not to have bad surprise in one year and a half with the AWS billing. Right now we are still around half a dollar per day. So 15 to 20 bucks per month. But yeah, given the rate of increase, we need to garbage collect that as soon as possible. So yeah, these are the issues. So I propose that we keep that top level issue on the milestone. We have the day Jenkins IO work and then we will consider the others. Is that okay for you? If anyone has time, you can start the other tiny shoes. But right now, the first one is the top priority because the most visible. Any question, objection, things unclear, suggestions on that topic. Next one, Matomo, GitHub, Docker repo. That one, we didn't do anything. Two parallel tasks. We are going to work on this, this milestone. We have one task which will be creating a MySQL manage instance in Azure. Like we have a PostgreSQL instance. Stefan, are you okay to work with this on with me? We pair? Yes. Why with me? Because you need to change eventually machine reconfigure. You might have a laptop issue. So I propose that tomorrow than you want, but the time I have to set it up. Exactly. That's why to avoid putting pressure, we pair, I work and you dictate. Is that okay for you? You just don't want to say that you love to work with me, but okay. Absolutely. Unless you want to bear with Stefan or with me or take that topic, of course. But the goal is this one and the second parallel task is take all the elements that gave in created. See if the Docker image is built or if it's missing something like the proper pipeline function on infrasci, the proper permission on the Docker hub. Check all the elements until we have the valid image published on each tag creation on the repository. Ubuntu 220 for upgrade campaign. Everything now is open to 22 except Puppet machine, which is still on 2004. So no emergency. We have two years in front of us. And we cannot move it to 22. That's why unless we switch from Puppet enterprise to Puppet open source in version seven. So today we have a Puppet six enterprise and we will have to upgrade it in any case to the seven or even eight lines. The pull request are piling that we cannot merge about on the Puppet module because they require Puppet seven. So Puppet seven will be a topic to treat in August. I don't think we should focus on this on the upcoming milestone and defer during the month of August if it's okay for all of you. The machine notes in the Ubuntu 20 2004 is the PKG slash update virtual machine that we mentioned earlier. So the work for that issue is VM PKG. So every is already working on the update center part and on my side I've started studying and I don't mind anyone helping me or pairing or taking over I don't. I'm interested but I don't mind sharing or delegating. The goal is to move the whole package generation including the 500 gigabytes of packages to be moved from AWS to Azure. And so the process that we use and really see I to build the core and package the core of Jenkins should be able to run all its step locally inside the agent instead of one of the step being a remote SSH command. Once it's moved on that agent we can take care of changing the Docker image from Ubuntu 18 to Ubuntu whatever or whatever system with the proper tooling. But right now we need to migrate that data. That's the first step and that will require moving the PKG origin Jenkins IO service to a new Apache service on public gates. The same idea as whatever described for update Jenkins IO that will split the Apache that we have currently on the virtual machine which has two virtual hosts mainly in two different services with high availability inside our AKS cluster. That one is low traffic because backed by fastly as I mentioned earlier. So the risk for the cost is is almost nothing. However, in terms of maintainability that will clearly a huge value. Any question. No, okay, so update Jenkins IO service by her way. One last work in progress is Linux Foundation status page. I have no idea what this topic is about and I need help mark or it's that Daniel Beck asked a question that needs to go to the Linux Foundation and I've got to raise the question to them. As far as I know as far as I can tell the issue is absolutely resolved. What Daniel's question was is, does the Linux Foundation for future cases need to be sure that their caching system does not cash too aggressively. The cash to raise cash lifespan cash time to live may need to be reduced and I had seen something similar minutes after the, or in the hour after the Linux Foundation had brought issues at Jenkins that I owe back online. But this report from the user was a day later. Neither he nor Daniel, neither Daniel or I could duplicate it, but it's worth the question to LF. Okay, I have the action to do that. Okay, cool. So that means we can keep you assigned to the issue moving to the next milestone and wait from the feedback. Cool. Thanks for the clarification. Take care of it. Thanks Mark. There's a list of new items, whether to be triaged or to be considered for the upcoming milestone. And open topic of moving the public IPs of the cluster that we mentioned earlier, the IPs that were deleted. So we have that issue to track the first testing that we can move this public IP from a resource group to another, based on the documentation and element we have. So the goal will be to create some dummies IPs create a dummy load balancer in the cluster and see if we can move them without breaking the load balancer. We know what is the effects of moving the public IPs and updating the load balancer notation so it knows where is the new IP located. Once we have done that, and we know the behavior on the dummy load balancer. If we can plan the operation of the current public IP, that might be a brownout that might be that will be worth planning and announcing it. Because that could cut connection during one or two minutes. But if the goal is to add this to the current block, so we are really, really sure that we don't have any block. Are you think you will be able to walk on it in the upcoming milestone? Yes. Okay. Is that okay for you to take the issue or do you want to someone else to walk on it? I have already signed myself on it, I think. Okay, cool. So I will add it to the new milestone. Is that okay? Okay, next one. Miror Genkin CI org is missing some necessary metadata file which prevented from being added as an apetitum repo. I forgot to pass the link, but that will update. I wanted to mention this one because we told the user will delay. So right now this issue is blocked because we need to first migrate the origin somewhere else. And the part about the yum repo is complicated because we need to be able to provide CDN for the packages inside China and other networks. This is an issue related to the amount of data that is stored on the mirrors. We have archives Genkin CI org right now that will attack as a default fallback. But it seems that the apetitum repo pattern that we use only have indexes on the top level domain, not on the mirror domain. So there might be some elements such as mirroring allowing the mirrors of the packages of the package indexes. But that problem is the same as the update center. It means that these indexes must be updated almost immediately when we change them due to a security release like tomorrow. That's why Olivier never went that road. So that issue is blocked by the world that everybody is doing on updates Genkin CI org. Because that means if we can have our own instance of update slash package with a mirror or director system that we control, then we will solve that issue easily. And we can also put the package indexes on the pocket somewhere else with a redirection and keep the big packages.dev.rpm on the get Genkin CI org mirror system. So that one I will have a comment and I will put it back on the not the current milestone but the infra team thing next because it's blocked by migrating pkg on the date to somewhere else and start the redirection to provide to China. There is two requirements here, but at least I want the user to know that we are working on this topic. Okay, for everyone, a comment that it's blocked by the update garbage collector. I don't think I will be able to work on it. I don't know for you folks. I don't think it's it's I think it's a low priority issue. Okay, less than 10 dollars per month. Good for me. No objection. Just made me think about the redirection build we enabled back the spot instance might want to add garbage collector for kicking out This is the Azure VM user more than 24 hours long. Good point. Do you mind adding a comment on the ATH issue that James open so one next week when we will check the feedback of the one week without spots. If we want to persist them we will have a message to remind us that we need to implement the garbage collector along with that change. Is that okay for you. And from now we need daily to check if there is some some VM. Because we changed the spot so we need to remember that remember that. Yeah, but worst case we can wait on the weekend and change them. The impact on one week is almost is low. I can say it's nothing but it's low. But you're good pointer. Good pointer on both. You you are volunteers both of you you mentioned the topic. That's a gift. You're so nice. I'm adding a note it was 88. Let's work on a GC. I've added a note also on the team on the weekly meeting. I commented on the issue. Cool. Thanks. Next one. The two next one are topics for Stefan. Hence your back from vacation. The topics is related to RM 64. So the main the primary topic is check which services can be migrated like Java doc to the RM 64 not pull to decrease the cost the operational costs of the cluster. So you have to put your hands back on the subject write things done share with the team and work on it if you have if you have located services including announcement validation etc. And the second one is working on the node the toleration and things. So we improve the way we we give tips and int to the Kubernetes scheduler when we move the when we define the services. The main challenge is to be sure that we can have two not pools of the same kind of different names. So we can drain the first one with a manual operation and then remove it. The what happened during the cluster deletion two weeks ago shows that yeah that will be really useful in this kind of situation that to allow us moving not pools by not pools and not all not pools at once like I did. So that one will be useful even if a bit more a bit less priority than the RM 64 migration. So that means Stefan I'm adding this to on the milestone and once you have a fully working laptop and not before you can start putting your hands to remember take notes and act on this one that might have to wait next milestone no problem. Thank you. The just last one before the IP restriction, a Kubernetes cluster defining fry CI admin SVC account as code that's an issue I opened consequence of the Kubernetes 1.25. I realized that I never chaired the shell script when we create a brand new cluster Kubernetes cluster. We need a technical user which has administration permission so that we can have infras CI Jenkins are you installing and managing charts as administrator of the cluster. That one was most of the time created manually by someone named Damian DuPortal with a dark and shadow shell script on his machine. So the goal of that issue is that that one should be done by Terraform when Terraform create a cluster once the cluster is created. It should create the accounts and prepare a sensitive output that we can immediately put in infras CI. So that's quite an easy one. And yeah, I've almost finished my proof of concept so that should be good for this milestone. The last one remove IP restriction on bounds or migrate to VPN. So that one is the way for me to assess a topic that we don't have issue about which isn't pretty important. We need to get rid of the old overlapped networks. We did the heavy work, mostly RV and the subnet that was overlapping and creating the issue has been deleted during the past two weeks. As part of the Kubernetes one dot 25 and the cleanups. However, we still have two services that we don't defy that are running inside the whole net public whatever network. One is VPN Jenkins IO. It's easy. We just want to get rid of that service and forget about it. So remove the virtual machine, the code and etc. We still need to clean up resources. So an issue is required for removing that service. And search that CI Jenkins IO, which is the Jenkins controller used by the Jenkins security team. I chose to defer the deletion of this one because of the security released for tomorrow because they require that controller to be able to work. That's why we didn't work on this two weeks ago. So now once they will have done the security release if we don't have any security release even privately shared with them in the upcoming two weeks. Then we can migrate search dot CI to the new network and clean up all the resources of the old network. We agree that what you're talking about have nothing to say with that issue. It has. Okay. And let me explain. So let me just write down that I need to create free issues. Delete VPN Jenkins IO resources. Move search dot CI to the new private network. And then delete all remnants of the old overlap networks. The reason is because that issues about trusted CI, which is in its own virtual network brand new somewhere, which is not the new, not the public private that we have today, not another one. And the fact that we want to restrict the access to the SSH and to that trusted CI we have the same pattern for sort CI. The restriction is not the same. We don't have the same person on both purposes. We have another lap. Daniel Beck of a deck need to reach both. But not every member of the GenSec team need to reach trusted and not everyone from the infra team, or at least our usual contributor should reach third CI. So they need either a separated subnets inside private network or their own virtual in any case today, third CI users reach third CI using the VPN. So that's the subsequent topic we need the same method in any case to reach trusted and third CI. And that same method should have to access control list that say, oh, you are that user, you can access both or one or the other VPN looks like the same one, but that's the same method. What is hidden is that we need to select a method that will work for both. So that means third CI should have the same paradigm at the network level than trusted, which mean their own virtual network and eventually not a subnet and we peer the network to the private network. Does it make sense now with that explanation in mind. Thank you. I've mentioned it somewhere on the issue. So we need to work on that one as well. Proposal is that I will start working with you folks on this one before working on the free upcoming issues, because first we need to validate the peering port. So if it doesn't work for trusted, we won't repeat the same mistake for third CI. Any question. Objection things to clarify on that topic. Nope. Okay. Do we have new other issues or new other topic. I'm looking at the new issues. I need to remove triage on these two, but outside is no new issues. Do you have other topics? None for me. Any question thing you want to say. Cool. So I'm stopping the screen share. I'm stopping the recording for people watching us. See you next week. Bye bye.