 Ok, tout le monde, bienvenue à la meeting de l'infrastructure de l'infrastructure de la week-league. Nous sommes le 31 mai 2022. Aujourd'hui, nous avons moi-même le menu portale, Bruno Barton, Stéphane Merle, Timia Combe et Hervé Le Meurre. Let's get started with announcements. It seems that the weekly has been delayed due to test issues on the process. I haven't checked more in details but Alexander mentioned that on the IRC channel for the re-desis. So that might be delayed of a few hours. I haven't seen anything related to the infrastructure on that area, on the error message. That might be, but first sight, no agent issue, no controller issue. Delayed seems to be test issues on signal announcement. The outages on the virtual machine hosting the update center and the package for Jenkins Core seems to be finally terminated. It's been 24 hours without incident on that machine. So yes, so summary, when you try to automate a virtual machine that hasn't been automated since two years. Read carefully the documentation of all the flags of Poupet agents. The flag dash dash test does not test the run of the Poupet agents. It applies the configuration. If you want a dry run, if you need to also add the no-up flag. If I would have known that, I would have avoided a lot of pain during the past days. However, it fix so sorry for everyone. On updates.chankins.io and pkj.chankins.io So there are details and post mortem that will be done on the issue. TS2888 Many thanks for the external contributor and team also for the support. And everyone involved for being solving that issue. Third announcement, that will be, we already have an issue to track that, but that's important. The certificate of the update center expires in two weeks. Five hours ago, the jobs in charge of generating the jason started failing due to that error. While the crawler failed since two or three weeks. So Stefan and I, we started on the renewal of that certificate. There are things that we don't understand on the private repository hosting that. The read me might not up to date. So we need help from Daniel or Olivier to understand what's going on. The core of the issue is that it seems that since last year, there are two certificates. One used by the crawler job and one by the update center job on trusted CI. There are two zips. The update center keys on associated certes file are the same. But the full certificates associated to these keys seems to be different between two jobs. We are not sure which one should be the correct and if we could only generate one. So we would need help today or yesterday because it's impacting the plugin release. We don't have any plugin release since it's failing. I don't know if you are at ease with that part or not. I assume not, but maybe if you ever tried. The read me is good as Stefan can confirm. We have the open SSL commands. The only issue is be sure, being sure which certificate is which, which key is which, which is complicated. I don't know for I think I'm a Olivia and cost okay. Okay. Yeah, and Tyler, but yeah, I don't consider didn't touch that repository since 2017. So Daniel and Olivier are our last chances. Without answer, we will, I will try to generate certificate and run an update center locally on my machine. That will take a few hours. At least I will, I will be sure. So we can try the three of them and make sure which one is working. Yes. You shouldn't need to run a full update center run. You can filter it to just a single plugin or something. Oh, yes, true. Or only the past changes from the last hours. Yes, correct. Yeah, it should be like a minute or less. And then you can point Jenkins at your file. If you just run it on a local web server or something, or even file paths, maybe. And then you can check that it accepts the signature. You can do dodgy one and then a working one. Okay. That's all for announcement for me. Do you have other announcements folks? Nope. Okay. Let's get started. Thanks for the, I don't know who updated the links, but many thanks. Step one. What did we finished during the past iteration? Depreciation of on colandels monitors. So that was a page of duty slash data dog related issue. Thanks survey for that part. So it appears that we were using the deprecated feature on the new one, which integrates data dog to page of duty. So it was only a matter of removing the annotation at on col, whatever from the terraform definition that manage the data dog monitors. We saw. I can guarantee that we saw issues during the past days of the data center and we received emails. So that means that it worked perfectly. And access to the Jenkins VPN for Vincent. That access wasn't required in the end, because he had the needs to trigger releases on the remoting component. It seems that it's not required and Mark took care of everything that was missing. But still, since Vincent is also the main main, one of the main maintenance of the Kubernetes plugin, it's quite useful to have him being able to access our infrastructure, especially if we need to test our technos, Kubernetes elements. Because the VPN allows him to access CI Jenkins. So certificate expired for pkg.jenkins.io. It's one of the part of the outage that I mentioned earlier. So for that one, the enabling puppets last Friday on that virtual machine, it hasn't been run since two years. And as I said, it broke the certificates. So there has been a quick fix, but it took a few hours to get to this. So that's one of the three outages for that machine. Many thanks for the contributor who notified us. Google domain verification. So on that area, I saw message from Gavin, but I didn't code since it's closed. OK, so the domain stories Jenkins.io has been verified. There were an issue between the CNAME. You cannot have a CNAME and the TXT report, if I understand correctly. I don't know which method gave it in the end. And I confirmed that I received an access to the Google console. I can give access. Woo, there is interested or should have access. Raise your hand. Yep, team. OK. So team and high. I will help Mark also as backup. OK, so for that one, it's closed. Can we add for the person taking notes, can you ask you to add this on the to do next milestone? Add access to Google console for Mark and team, please. So I don't forget. Update center, plugin in stable. So Stefan and I worked on that. I tried to work always with someone and for to not be the only person fixing the incident. So Stefan, thanks for the help around the Apache configuration. We had some head scratches about how the act that the piece of software is working. The answer is we still don't know, but we try to guess. And we guess right because it seems that the machine is. We don't have alerts anymore. So not only we fix the issue, but we improve the situation from the four past month is where we regularly had alerts on update Jenkins sale. The main part is being able to handle enough walkers process that are able after five or six hours to keep serving request without being in a word states. So it's a matter of handling between the people have connection, the amount of threads, the amount of processes, the amount of parallel simultaneous connection, etc. But now it seems we have some margin. Just a note, we talked that the virtual machine seems to be located and pins to an easy to supervisor, which is in poor states, we have a lot of CPUs still on that machine. And except changing to the category of just the one upper or the one lower in size, we cannot change the CPU mainline, we cannot move to another advisor. I assume that that machine is really, really hold and it's, and it's weird. It's using easy to line of category instance, which is 2013 generation. We cannot benefit from the latest network, and we are not able to migrate it's the only solution is either. We used, we create a snapshot and start the brand new machine and provision it and move everything. So if we have to do that and you see to better to do it on Oracle. And finally issue github org with me. Can you explain Nervé, I'm not sure I followed up so. So when I, there is an issue working progress where not my fault asked to verify or domain and to get the domain approved, I had to remove the mailing list address, email address. Because it, it's in Google groups.com domain, so I can't verify it. And I put this email in github org profile with me, so it's still accessible. And it was just for this public email was just for displaying on top of the. J'ai un main page, un page. Nice. Clear. Yeah, it was an issue or something I wanted to do since someone's already so it was an opportunity to do it. Can't be improved, but I think it can. First, first text. Okay, seems clear for me. Any question. Okay, let's go to the walk in progress. So approval of the Jenkins IO domain on the github organization. I understand it's done for Jenkins Infra. Now it has to be done on Jenkins CI. Is that correct? No, it's done on Jenkins Infra and Jenkins CI. I've pinged Daniel via their team security team to ask if they wanted to do the same on Jenkins cert organization. Okay. So I assume we keep the sign to you, Hervé, until the answer. Should we stop tracking it and waiting? Yeah, we can track and stop tracking it. We stop tracking it so I can clear Milestone and Assini and if there is an answer you will be notified. Okay. Okay. Thanks a lot. Collar, Bill is failing on trusted. So that's the subject, the main priority subject in short term. We need to generate a new certificate for update center. To be sure that also the updates. She's on isn't is working. Stéphane, can I let you add a comment on the issue. Saying that the Gison file are failing as well. Yes. And so, so we can. We can aggregate command if needed. And of course, Stéphane, hi, we keep working on that one. Now it's top priority. And we need help from Daniel Olivier. Is there any question on that part? Okay continue. Deprec, TMG in favor of docker. It's working progress. I mean, I have some tests to work, which are failing, but it will be quickly. And then I will do another part, which is removing a retrace off in the pipeline library and some other repos. Okay, so your commands where you propose to create a separate task, no problem for that. That's a good idea. So we can have completed and delivered tasks. So almost there. My understanding correct. Okay, I've added it to the next milestone assigned to you. Any objection. Okay. Migrate updates. I can say you to another cloud. So that's what I mentioned earlier. So I didn't have time to work on that one. I hope I should be able to do it after the data center. The goal is to spawn temporal first machine on Oracle and start adding it to the existing flow, but it won't be serving data to the end user for now. So I'm sure that each time the data center jason generation is done, it will upload the files, not only on the actual machine, but also on the new Oracle machine. There might be a proposal on that issue. I haven't written that yet to spawn two machines to have a kind of a chase. So now each time we operated on Apache server and we had to restart it to apply a specific settings. We were breaking the older Jenkins installation that we are trying to get updates and turn that moment. So the goal is to have a poor person ha, which means one data volume mounted on both virtual machine. And so each virtual machines is only a tiny one with Apache. The goal will be to spread the load between two machines. And when we have to do maintenance, we don't break the data center for everyone. Hello, best server ever. I don't know if you have any question about that topic. 123 OK, so we continue working on that one. Import and manage Oracle cloud resource. So I propose, given that it's been at least two of reiteration that we have both issues, I propose to not put the import managed on the next milestone. If anyone has time and is willing to do it, then you can do it. But I won't, I don't think I will have the matter of time to get on this one. So I propose we delay your fat list one iteration. Sounds good. OK, auto notify people based on service routing rules. Should we keep it for the next iteration? Yeah, I don't have time, but I would like to keep it on. OK. Jenkins in front of your image build our own windows images. Are we almost there? Yeah, almost there. Almost there. I have now my tagging working for windows image. I had some problem with credential manager installed by default on windows. Yes. No, I've got that I have to have a progress in proportion to be able to to build a multiple image with some very active. Currently, there are nothing preventing a conflict. With these multiple image built in parallel, and each of them taking their last tagger to generate an exception. So I'll add a parameter to what's the image name in the tag. OK. Nice work because it wasn't an easy one. But Washel is not is a beast. Windows is another. And Git is another. And Apache. It's easy. You know you won't understand it. While we know you think you will understand it and in fact you don't. Thanks a lot. Any question on that area? Shall we keep it on the next iteration? OK, cool. I have one quick question on something else. I raised an issue on the rate limiting CI Jenkins IE for Jenkins CI Docker builds. Not sure if anyone has any ideas on other things we can do there. Yes. So here we are waiting for Docker Docker Inc. To promote organization to team because they already did. They promoted us to the open source program. We took that will change the rate limit, but in fact not. The images of the organization that we produce are not content on the API rate limits. So if you pull an image from Jenkins stash or Jenkins Infra, you don't. So that action helps us to secure CI Jenkins IE. However, in the case of the Docker images, we pull official images which are content on API rate limits. Since it's the bigger bandwidth, they have to pay for that ends the rate limit. So I had a discussion with them two weeks ago. They said they would have to wait after that long weekend because most of the team is in Europe. So they were they were all out of office last week, but they are going to give us that that should be OK for us. Because right now we have 200. We have the right 200 request for six hours time window. And we are rate limited, which means each time there is a dependable opening a few requests. We are done. Yeah, I tried. I tried to. I followed their instructions on their documentation to get the current rate limit and came back with nothing. The only thing I got back was a rate limit source header. There was no rate limit remaining or anything. I don't know for today, but this weekend, when was it Friday or Saturday? Last time we had a first batch when Basil, it's when Basil opened all the Adolins pull request on the Docker image. And I checked and I saw that it was to zero and they were a time limit a few hours. That's why I let comment. So that's the second or third time I do it. And they confirmed on their own. So yeah, the alternative is thinking about bringing cash. I think spawning cash on the thing is that will be a service. We need to restrict that service to avoid being spammed or mining bitcoins. It has to be IP restricted to the IP range of both Azure and EC2 where we spawn the virtual machines with the Docker builds. So we need to spawn a registry with the caching mode. Yeah. Okay. I'm adding that to the next milestone. I will ask Hervé, he's back from holiday since today. Yes. Another Hervé, not this one. Hervé from Docker. Because Hervé's manager is in holidays. I'm adding it and taking it since I'm the person in contact. I will ping them. We should have something. But so yes, I saw that. And if we don't have anything end of week from them, then we'll try to docker registry. Thanks. Thanks for raising it. Had it to next milestone. Mirror's Jenkins. I can close it. We did everything. There isn't any mirror brain stuff. And the machine is okay. So I will just double check that I don't forget, but that should be close able. So I keep it on the current milestone and it will be closed. Is there any question or doubt or fingerprint clear on that issue? We haven't heard any issue on the mirror part from end user. I don't know if you did. I saw one last week, I think, but I told you and they corrected it was on their part, you know, they weren't able to handle the 308 status code. Oh, yes. Yeah. Yeah. Is it open Indiana package distribution correct? Right. Okay. So some Linux distribution package Jenkins by themselves. When they try to download the war from the mirrors, sometimes they have a tool that doesn't follow the direction. So most of the time, they have to get the full link on get Jenkins. You and ideally, they can use archives for that. Because that's not a I, that's not something done often. Okay, might be good to add maybe a warning. Or documentation somewhere. Mirror in Singapore. We send them the information. So now there is a runbook explaining all the elements. I need you folks to check the runbook. No, sorry, my bad. The runbook is already merged, but as team you propose that should be public documentation so we should add that part on the public page so we don't have to get information from a private documentation and send them to people. That should be a public link for the future. So I've opened a separate issue. Now, I don't know if they answered to that email. I didn't saw any answer last Friday. I haven't seen, okay. I'm watching that. There is no action expected on our side. They have to to run the action and send them send us the URL for the Jenkins mirror. So I propose that we clear the milestone and then we wait for from them and we'll bring them. I will ping them in two weeks. You also suggested to add an issue type on the desk for the mirror. Oh, yeah. Yeah. Are they a lot of, not a lot, many mirror demands like that. Very few. Very sure. Not sure an issue type for this. It would clutter, not clutter, but yeah. Just for that, I'm not sure. I don't know. Is there anything that we can do on the upcoming milestone? I've added it to infra team sync next, just in case someone has time. I'm clearing the milestone for the mirror itself. Anyone willing to decide the issue or whatever is welcome. Digital ocean. I need to take an appointment with them to discuss. The goal as explained last week is to prepare different scenarios of sponsored chip needs. And based on that they will renew or not and will negotiate. They seem quite positive to continue with us. That's a good new. Now we have to just have to do it. So I'm adding you and I on this one. And we need to just take the appointment. So I'm adding it to the next milestone. Looks good. I'm going to add any question on this one. Okay, permanent redirects. Stories, isn't that one finished? Oh no. So we have a redirector service created by Stefan on Kubernetes cluster, which works very well. That has been updated to keep the context. However, it seems like that. I don't know Mark and Alisa didn't move the CNAME to our infra. So we need to configure redirection. So it partially work. It's improved. But yes, we should take over the domain name from Godaddy. Yeah. The link currently doesn't redirect at all. Yeah. Only in HTTP. And it redirects you to the root. Yeah. Yeah, I did the same. I'm removing the milestone. I'm putting it to team sync next because Mark is out of office. Clearly, he's the only person that can act on this one. Stefan, I remove you from this one and I'm adding a Mark to this one. Sounds good to everyone. Perfect. I'll try to contact Alisa in private this week, just in case. Finaly, CI Jenkin Sayo. So that's an old issue from outage from one month and a half. I haven't heard from anyone feedback on the post mortem. So I'm going to proceed this week to publish the post mortem and close the issue. Unless there is someone who doesn't agree. But then I need a feedback if that's the case. I said it was great. That was my feedback. Oh, I missed it then. Sorry. So nothing to do here. Milestones. Sink next. When was there? Oh, there any issues on the infrateam sync next that you want to bring and maybe prioritize. Oh, so they were. Yep. I have been asked by someone to brought up the upgrade to the Kubernetes 1.22. But I'm not quite sure it's a priority. We can. It can be a little bit every day. Yeah. In one one big block. So proposal for this week, I'm adding it to the next milestone. Finish the upgrade of kubectl in Docker and file. Should have been done. Nope, it's not done. Oh yeah. Oh, let me rephrase. Is it in production yet? No, it's not done. So my proposal to you is finishing upgrading kubectl on the image. Because it's required. Close up the. Because it. I'm sure you will start the upgrade without the case yet. And then you can start to prepare. Read the changelog, both of your events, Stephen. Or continue or prepare it. Is that okay? And next week we will discuss the timeline, depending on the outage of the week. Can you just repeat what do you expect for the case? Nothing. You have to update the issue for doing it. Good. Looks good. I'm adding into the next milestone then. Are there other issue that you think you can work on on the next? Or you think you should work on instead of another? Okay. I prefer to. Finish to progress on one. Yep, no problem. That's all for me then. I'm going to finish updating the notes before publishing it. There's something else you want to say before I stop the recording. Okay, let me stop the recording. Stop screen sharing first stop recording.