 Hi, everybody. Welcome for this Jenkins infrastructure meeting. We have quite a lot of different topics to cover today. So, yeah, let's start the first one. The first one, it's a small reminder. It's a related to the Docker hub organization Jenkins forever. So as a quick reminder, that Docker hub was put in place so people could quickly iterate on experiments, to build Docker image for specific usage. Since the beginning, this Docker hub organization was, I mean, that was clear for everybody that it was an untrusted Docker hub organization because everybody can push image there as long as the person opened the right pull request. So, yeah, I saw that several people started using those images. Again, if we need to have long running Docker image, then we need to engage a conversation to see if we can host them on the Jenkins or Jenkins CI infrastructure organization. And then the process to publish image there is better, basically. So, yeah, just quick reminder. The second topic that I want to briefly talk here is about hosting a new set of Docker image for you have been signed up because you're currently signing with someone used to the CDF accounts. They continue. I think it's fine. Someone requested to add that the Jenkins infra project start building a specific windows server core image containing adult open gdk targeting the windows version 1909. What initially I was in favor as long as that person was maintaining the Docker file in the process and so on. The reality is put strong constraints on the infrastructure that we are using and we don't. So basically we need to be a packer image for that specific windows version and maintain those packer image and then we have to update the process and so on. So considering that that version is quite close to the LTS one that we are maintaining. I just proposed to reject that PR and to close it so we don't keep that PR open for long time. I think we are quite yes. We need to write up a policy or something to say that will support the LTS version and because otherwise we'll be we'll be doing this every six months with windows versions. That's a good suggestion. Thanks. Yeah, and yeah definitely it's small, small sentence in the ring me to just say, yeah. Well, and I think it should be part of that Docker image Jenkins enhancement proposal that we discussed at the contributor summit. So Docker Docker images must have a code owner before we'll adopt them accept them, and we will start flagging them as up for adoption, when the code owner goes inactive. So, so when they when they don't have a code owner. So, yeah, I still got to write that. But in this case it's a little bit more because in this case we also need specific infrastructure so it's not only about having someone who want to maintain that specific image. But it's also about maintaining the infrastructure that build those images. So yeah, we'll reject that here. The next topic is about Seattle Jenkins that I also basically last Friday I did some I mean I did quite a lot of modification there. First of all, I did some credentials cleanup so I remove credentials that were not reported as use. I don't mean that those credentials were not used but yeah so maybe maybe I remove the credentials so that was effectively, effectively used. So you know what happens also reduce the permission of some credentials like those who creates resources on Azure. So they can only create resources in a specific resource group. No. Sorry. I also took that opportunity to about easy to play games so we are now running the latest version. What I noticed after the version updates was the easy to configuration was a huge mess. It was like wrong credentials easy to plugins were configured to deploy resources in Japan and stuff like that so it took me a while to to reconfigure old images. So the execution with Damian has the fight that we definitely needs the Gcast configuration for the CI to Jenkins that I owe. While it wouldn't necessarily make sense to deploy CI on communities right now it would, I mean, would already be able, would already be nice to just configure that instance that's a from puppets. It's a huge works because we already have everything in place we have the perpet agent we have the process there. So we just have to put in place that the right templating and it just ERP templates. So maybe I'll start working with Damian and that I don't think we need a lot of work. I think I've already prepared a lot of things last year, while I'm not sure that we will be able to reuse reuse everything because a Gcast plugin evolved since then. Yeah, it's already a good start. While we're talking about the easy to plugin also saw people complaining a little bit more about time out issues. And I have the feeling that people stop complaining in the past but I don't know if the problem was still there but just people stop complaining or the problem was gone and it reappear. Now it's definitely been a problem boys. Never ending source of bitterness and frustration as it as it really should be it's, it's really painful. It's especially, especially happening on Windows agents for me. And if you have high men high men, but is whenever I had a plug in that has Windows agents is just a nightmare. I'm just wondering, I'm just wondering if, if we could just really play see other Jenkins that are on Amazon, and just have the agent in the same region. Maybe that would solve because we already have the process to deploy. I mean, Seattle Jenkins that I was on Amazon previously. So we already have everything in place to switch back to that location. I thought though that we had critical dependencies on Azure container instances can we still use Azure container instances if the root is running on on EKS. So we can. So we still we can still use that definitely. But this bring me to another topic Damian has been working on deploying EKS on Amazon. So this means that we could just use you can we could replace Azure container instances with Kubernetes pods. I see. Okay, so the concept. Okay, got it. Thanks. So this would reduce the dependency on Azure. Still, that's a good, that's a good question, given that if we want to achieve somehow a multi cloud capabilities like let's say one controller for CI Jenkins I you having different Kubernetes cluster for instance or a cube cluster and a bunch of virtual machines to provide because Kubernetes is not a solution to every problem. So I will say we need a bunch of different solution. We need to have an idea and maybe start thinking measuring how much bond with and data are exchanged daily or monthly between the controller and the agent areas in particular because some let's say nasty cloud provider AWS for not giving any names or tends to give you free trans bond with transfer until a certain threshold is met and then afterwards you have to pay a lot for that. And I'm sure most of the cloud is a kind of strategy to keep you inside the cloud and products. And I think this is something that should start be worth measuring now if want to consider in the future different migrating scenario like this one, even though even if it's completely parallel topic and not blocking the C2 timeout and the EKS deployment for CI. That's definitely a good point because something that I wanted to test was also if different regions would provide us better performance, because I know that the controller is running on Azure in the US East. We decided Amazon provide different regions in US East. So maybe we could use US East to, I mean, I'm not sure if this would improve the network performance. But one thing else to consider about this timeout is that most of the time the timeout sounds related to the agent to controller connection. I know for a fact that first WebSocket has solved part of the former GNLP through TCP topics, but still WebSocket is still a tricky protocol to dial between the controller and the agent. We should check that part as well. How do we have a load balancer between the AC2 agent and the CI Jenkins IO. And in that case, we should check how this load balancer is acting at the TCP level. I don't think we have load balancer at this level. I mean, not that we deploy ourselves. Okay, so then we should start measuring the TCP connection and setup of the kernel of the VM hosting the CI Jenkins IO as well. And or maybe checking the logs that we already have about timeouts to be sure what kind of timeout is it can we measure? Is it a timeout while starting the VM? Is it a timeout when the VM tried to connect back to the controller? Or the other way around? The challenge that we have here is those agents are dynamic. We don't have any monitoring place there. So we don't create data right now. And also sometimes it works. Sometimes it does not. So maybe one of the solution would be to add that agent there. So we could start creating information. I just have to double check about the credentials, because then it means that if we do those experiments on CI Jenkins IO, again, CI Jenkins IO is not a very trusted instance. Just to confirm what you're saying that you're thinking of moving CI Jenkins IO to AWS, but the controller would not be running on Kubernetes. Yes. And the agents would be running on Kubernetes, obviously apart from potentially some packet images or whatever that we need. So the reason why I feel uncomfortable to switch directly to Kubernetes is because we already have some processes in place to manage that instance. So for instance, Daniel, when you do a security release SSH on that machine, restart the pod and stuff like that. And so considering all the other things that we have to work on, I'm not sure that switching to Kubernetes will bring enough value to that instance right now. I mean, I think it will bring an awful of it, but I think it's a trade off between the value and the effort, possibly. Yeah. That's what I mean by switching. We still need to fix the easy to time out. Because we will still need VMs provision dynamically. That's something we cannot avoid. Yeah, that's why I feel uncomfortable to just really play Seattle Jenkins IO on the communities cluster, because I don't think it will solve for short term issues. But but switching the controller to AWS means that the time out between the virtual machines that we provision and the controller would be closer or shorter, because they will be running in the same region. In the same cloud accounts versus multi clouds. And regarding the ACI issues that we had today isn't always the workload split it between the ACI agent and the easy to agents. So it depends on the labels on the labels. Depending on what you need. If you just need to run something in maven, then we just provision a maven container. And you do the workload there. But if for some reason you need to run a lot of tests and you need, you need an easy to machine I mean a virtual machine for specific use cases. So right now we need both. Because it depends on what on what we test. I'm pretty sure that the ATH test needed for machine. But yeah, it depends. I mean on that on that topic at dance. I don't have strong opinion. I think we can provide both and both of their. Okay, do we understand correctly that potentially ACI agent could be replaced by Kubernetes agent. That could be an ACS while that that's that's what I said. Okay, like five minutes ago that if the work on ACS is ready. And if we can start using that cluster to for the CID agent agent, that would mean that every resources will be in Amazon in the Amazon accounts. So if we have the easy to in a region if we have the ACS in the same account, we could easily move CID agent to the Amazon. So we would have everything in the same place. And so obviously the time of issue. I mean the response time will be smaller. And so the reason why I was saying that is because in the puppets in the puppet configuration everything was running on Amazon in the past. So we were using easy to we were using the name, but yeah, the equivalent of ACI and we were using ACS I think. And everything was running in Amazon and then we did the migration to Azure. And then then that's why we start using Azure virtual machine ACI for the containers, and the controller was put on Azure and then we are doing the migration back. And the good point that you raise here Damian is, we don't know what the future will be so if we can run as many as we can in occurrences these give us the freedom to move between cloud vendors. But yeah, for now, I think it's better to just focus on the agent issues than where the controller is running. And if it can help to move the controller then let's move that controller. So are there any regional problems around having infra and releases, not being in the same area is. Do you mean if you mean for the release environment. Yeah, and for info. So the other so the other have. So they're just independent. So the trust. So right now we have multiple Jenkins instance. What they all have in common is the fetch the code from the same GitHub organization and then push artifact but for instance we don't we don't generate artifact from Seattle Jenkins earlier to push in the different location. So right now we have less. I mean, for instance, for release the CI or for the CI it's fully running on committees. So we just provision but when we need that. Sir.ci and trust that CI have a much lower usage. I mean they just provision knows from time to time just to do some specific tests. So the CI CI the check is that is definitely the biggest one. But we try to keep everything there in the balance of everything else. Any last question on this topic. So, so the next and the next point is for me working with Debian to configure. So this will definitely help us in the future when we have the issues like what we have on Friday, when we have to reconfigure everything and audit and try to understand what changed when and so on. So this was definitely simplify the management of the configuration there. And then we'll probably try to identify how we can add monitoring agent to dynamic agents so we can add monitoring to those agents. So we could maybe better monitor our performances. So, if you don't have any more question. I go directly to the next point. So basically, I mentioned a case. So Damian work on a terraform code provision and a case cluster, we are almost there. As far as I know, what is what's remain is the credentials that we would have to put. So how to automate so no sorry, we still have small configuration for that cluster. We want the data agent or basically what we need there. But ultimately, this cluster should only will only be used by CIDA Jenkins.io. So we, we are just thinking the best way to configure that cluster. And also, we have to cut to create an account that we can use to connect between CIDA Jenkins.io and that cluster. So the cluster is running. But yeah, we are fine tuning the configuration. So the next topic, which is about increased controller. So this one is related to the main cluster that we have right now. So we directly we've been using the nginx controller so the controller is nginx controller was the, the, the service that was the NGP request and then forward those to the different websites like Java, the plugin site, main website and so on. VPN broken should not. Let's look at that after. I'm sorry for that. So those, those nginx controller are still running, are still using the have v2. So we, we have to move now have v3 and we took this opportunity to deploy traffic and to experiment with traffic. The, the, the, those nginx controller already. So the plan is now to switch the private service. So like infrared CI release the CI, just to wait that everything work as expected. And then we'll start switching the public services. There is only one that is tricky, which is to adapt. Because it's a stateful application, but yeah, it's only it's only Dennis configuration change. So we have to plan the work and that's any question. So I propose to switch to the next to the, to the next topic, which was bring by Mark, you want to talk about duty experience improvements. Yeah, I just, I was going to alert people that I intend to schedule a session, Olivier with you and me and probably Damian to work through. What does it take to make the on duty, the pager duty experience better. I again yesterday got five or six or seven alerts telling me weird response time that I didn't quite know what to do and I'd like to learn about how to do it better. So I'm just going to schedule a session. I assume we can do it in public and allow other anyone who wants to to join but but I would like to understand what tuning we need to do to help that experience be better for me. Okay, sure. Let's plan that. If you could add me in on that. I also got the same sort of alerts over the weekend. Okay, let's let's plan this session. I'll do that. Just just for the context, we, we have monitoring in place to detect many different issues and one of them is if a website is getting slow and what we have, I mean, and more often at the moment is a situation where the website is slow. So we get an alert and then the issue results by itself after 15 minutes. And so we get a lot of notifications saying that some services are slow, even if everything is working normally. So yeah, we have to find you the thresholds alerts. Right. And for me it's a great excuse to learn more about data dog to learn how we can use it effectively. And how do we adjust it when we when we want so I'll just be scheduling it. So the good thing is, it's mainly about modifying to our form codes. So should be, should be easy. Next, last topic. The Jenkins release to that 277.1 was released tomorrow and I think everything went, went well. I'm always happy to see that the reason varments work. The release work today's weekly went great past the checklist everything work tomorrow's tomorrow is a much bigger deal is a much bigger thing and I haven't done the release checklist in detail yet I'll do that after meetings and today. So, like think four or five hours from now. Okay. I don't have any other topic to the agenda so I propose to have one last that I just arrived five minutes before the meeting that's why I haven't blocked it in the notes. I have so I have to feedback about the two cloud companies. So the company name outscales providing an easy to comply and they are not willing to. In the end they are not willing to sponsor. However, keep using free software and not giving back literally in French in the text. Yeah, I've been told so. And I have a feedback from scale way, they are okay but they need us to help them defining the amount of resources we will plan to use on a Kubernetes cluster for the agents. Since I have exactly the same requirements to size correctly the ecs cluster nodes. That should be an interesting topic so if, if we can schedule a discussion or meeting or whatever or tickets and you're with this information I'm sure there are already some documentation and metrics that are available. I guess I need help on that topic in order to be sure how much machine are we going to pay on AWS and how much can we ask to scale way as well. Okay. That would be a nice exercise. I was just going to mention the switching over of in front and releases to the green sort of belt docket images is still kind of blocked on the versioning stuff, which is still blocked on Jenkins pipeline unit test. Okay. So that would be an influence in the Jenkins pipeline unit project will be great. So could you could could you put some links here in the documents so. Yeah, so the changes be merged, but I can't use it until I have a release. Okay, so that was that was the thing I wanted to check Gareth is I saw that it had been merged, but a merge is not enough we need a new release of it. Unless there unless anyone knows of a way of pulling a release pulling a snapshot of that project, or we can build it manually and upload it to somewhere or messy. You hadn't asked for release I've just asked. I have asked for release, but I asked for a different one in another issue that I did get feedback on. It got assigned somebody else to do it. Okay, you've got a specific issue. There's two people that was in a re normally. I mean, he replied yesterday saying I do it. I guess I should get sort of zone and just keep chasing. Could you put the link to those PRs and issues in the Google notes. So other people can follow. So sort of, is there a stuff that's kind of quite critical to what we need to do. We have the ability to cut our own release on some of these projects. And the answer is yes we can depend on a pipeline library by. I think we can depend on a pipeline library by Shaw one. At least not this. This isn't a pipeline library this is a Maven project. It's just, it's just like any other library when we're not the maintainers of that project. Got it so it's delivering a jar file Tim not a, not a, not just some groovy code. Yeah, no, it's, it's someone has written this library and they are maintaining it. Seeing who it is. While the documentation is not something that we have control over. Okay. But yeah, just need to nudge them. More we nudge the quicker to get done normally. They've had a couple of nudges to give them another couple of days. Is there any work around you can do in the code to work around it in the meantime, or is it just the work around? It's not possible or too difficult. I mean, I can fork it and publish it under a different group of group of artifact ID or something. I wouldn't bother with that. But it's just there's nothing. Yeah, I mean, can you inline something into the library temporarily or. Or even just comment out the part of the test that fails. I'm not sure specifically how bad it is. And just say through to once libraries bumped uncomment. I could, I would be removing all the existing tests we have for that part of the functionality. Right, which it feels. Bit too dirty. I just just coming back to it. I think Damien added them all as well. I don't want to remove this. Give them a couple of days and then maybe email them if you need to normally gets people moving. Okay. Any last things you want to bring out we are we are over time so I propose to stop here and continue the discussion in RC. Thanks everybody for your time and see you. Bye bye.