 Hi everybody and welcome for this new Jenkins infrastructure meeting. We had a quite busy week so we have few things to announce. The first one is we are planning to upgrade the community's cluster that we are running. We want to upgrade from version 1.18 to 1.19. We'll work on it on Friday to first identify potential issues and if it's depending on the situation we may also jump directly to 1.20. So that's the current state. We just want to be conservative as we don't want to break too many things. We have the Jenkins contributor submitting three weeks so we want to be sure that we don't introduce potential issues before even so that's why we are starting now. So the plan is we work and we prepare it on Friday and then depending on the outcome we'll do the upgrade next week more to come. In terms of event, what happens? We had several issues in the abundance but several issues. The first one is yesterday some people could not authenticate on reporter Jenkins here.org. It appeared that GFrog who was maintaining that service didn't upgrade and the IPs changed and we needed to allow it. So those IPs in our ADAPT configuration in order to authenticate. So we needed to identify which IPs needed to be added and then now it's working correctly. So I'm currently writing a post-mortem if you want to have more details. I'll publish a post-mortem on Jenkins infrastructure documentation repository. The second thing that I want to mention is recent changes, recent change to the service to the mirror infrastructure, get the Jenkins.io. So I had to configure archive to Jenkins.io in order to be available on our air sync in read-only modes before adding it to mirror bits which is the case now. So now if you do whatever request to get the Jenkins.io you are always redirected to our mirror which is pretty important. So now the next step is to first add more mirror. So I would like to identify potential people interested to contribute mirrors. And we also identify a few ways to improve the performance of the service. But at least now it's working correctly which is great. Next topic is about the various issues we had with CI to Jenkins.io. So we discovered, I mean we had two main outages with CI. So we are using Azure containers, container instances to run some jobs. So basically Jenkins provision those containers, run and build and test plugins for instance. And for some reason we were not able to use those CI. So maybe Damian you can provide more feedback here. Yes, so two kind of issues. The most frequent one is related to ACI. We had two intervention in eight days. That's not the first time. So there is a bunch of issues related to the Azure plugin that are fixed by Team Yakom, thanks for that on the go which helps. But still we are seeing a lot of Azure API issues. So the idea is that we should not depend only on ACI because when it works it's really efficient because the agents start very fast and provide good performances. So one of the action point here on short term is I'm going to write a proposal to mix ACI workload with container or Kubernetes workload on CI in Kids.io to be sure that the blast radius when the CI goes down is not blocking all the jobs and that we try to balance the costs of ACI. The second kind of issues is one that exists since more than one year, meaning classical GNLP agent disconnection on CI in Kids.io. I've added on the report here a link to a one year old issue in FRA2548 that mentioned EC2 issues with the exact error message. So sometimes it's long running job. Sometimes it's random. It's really hard to troubleshoot. It seems that in the past it was related to pure infrastructure issues. Disynchronization between controller and agents, a disk which is full, not enough memory for the GVM. None of these symptoms were present during the past days. We control that and still there are still happen. One of the direction here on short point action will be to switch to web socket agents. But this needs some work to be done on CI in Kids.io reverse proxy. So that might not be a short term improvement right now. Thank you. Damian made some effort to write a post mortem. So we got two places where you can find them. So either we consider a post mortem as archive, which means that we're not expecting any modification. Then those post mortem are published on Jenkins in FRAZTAS documentation under the directory post mortem. And there you can see the two because that's only the beginning of that procedure. And so we try to follow a template. So we have a timeline. We have different things that were affected. So obviously this one does not contain the full template. And otherwise, you can also go back to akmd.io. And so it's loading. And so for instance, those are the post mortem that we are correctly working on. Sorry. And so you can see specific one for repo. Jenkins.io.org, ACI agent, and so on. So the way it works is if you want to provide some feedbacks, review, you just have to be able to have a, you need the link, but they are, they are, everybody should be able to read them. And if it's not the case, then you should, then it means that we have things to improve. But we usually, as I was saying, we usually follow. So we specify chronology. What's the, what was, what, what was the impact of the issues and what went wrong and so on. And so we started doing this. And so once the document is ready, we push the document on the, on the, on the key tree repository. So that's the way we are working at the moment. It's just easier to explain technical issues. Feel free to provide any feedback to that new workflow. We are really trying to improve the visibility of what's happening. Again, when we have an issue that affects the Jenkins infrastructure project, you can also go back to the status page, which is pretty slow. For some reason, for some reason, it's pretty slow. Yeah, that was just my machine. So if you go to status page, you have different kind information. Either you, the first thing you may see is if there is an, an actual issue. This is something that we maintain. If you are patient enough, we are also loading various dashboards. So those are just iframe fetched from the Datadog accounts. And so the idea is just to display basic HTTP response time. Because of the amount of dashboard that we have in this page, it's pretty slow. So we definitely have some, some improvements. So you just have to be patient enough. And at the end of the page, you also have past events. If you're interested to help us improve this status page, you can do that. There is a git repository. Again, Jenkins infra status. Sounds like my machine is just very slow. But you can build a status page from here. We work with pull requests and so on. So that's where it works. So if we have an incident, we start first, we publish the incident on the status page. Once the incident is closed, we write a post mortem. We usually communicate on the mailing list as well. And once we'll be more confident, we may start using this course. But yeah, those are just improvements. If you have any improvement you want to bring, feel free to, to share them with us. So let's go back to the documents to the meeting. I think we cover all the plan. So it sounds like we are good. We just have some cleanup to do in the notes. But it sounds like we are good. So we have identified three main action points. The first one is to communicate about the Kubernetes upgrades that will happen next week. Again, we'll start working that upgrade by the end of the week. I would like to write a blog post on the Miho infrastructure that we have now. It's relatively stable. We did a bunch of improvement here. So if let's say, let's take this example. So I did a few changes to the Miho infrastructure last week. So the first one is I configure archived Jenkins.io to be sure that it's only used when no other services are available. So this case in my case is pretty obvious because I'm quite far from that Miho. But I ran some exercise from the proxy located in the US. And I got confirmation that if another Miho is available, then you're always redirected to that one. And what I used to confirm my thoughts was looking at the stats. So we have access to stats for every Miho. Those data are reset every day. And so for every Miho, you have two lines. So let's say, Xmission, you have the first line, the dark blue. You can see the number of downloads that happen on that specific Miho. And then the second line, the light blue, it shows you how many data were transferred on that specific Miho. So in this case, it's two terabytes. But net was almost one terabyte. And so what I want to be sure is archived Jenkins.io remain not used. So this time it's 700 gigabytes, which is acceptable. But that is something that we'll have to monitor in the future. But I would like to write a blog post on this topic. And Damian will keep working in StabilizeCI, the Jenkins.io, to use Kubernetes agents on top of ACI agents. Any last topic? Damian, do you have anything else you want to bring here? No, that's already a lot. So then if you have any advice, or if you have a CI Jenkins.io user, and you see issues, don't stay to report them as soon as possible on IRC. And don't stay to propose help or ideas. If what we provide does not fit your needs, or if you see too much issues, we are there to improve the service. So don't stay to give feedback. Thank you. I already put the link for the next week meeting. So feel free to go there and provide, and to bring any topics you want to discuss. Thanks for watching this, and see you. Goodbye.