 Okay, welcome everyone to the Jenkins infrastructure public quick limiting. Today we are Tuesday 10th of May, 2022. As usual, I'm adding the collaborative notes that will be published in community Jenkins, IO and on GitHub. This session is recorded as well. So I've just added on the Zoom link and an IRC, the link to the collaborative notes. Today we have Damien Duportal, myself, Ferve Lemur, Mark White and Stefan Merle. It is started with the announcement. First of all, last week the weekly release was failing to be released because we had the credential on Azure that was expired. So that has been fixed. Thanks Mark for re-triggering the second time the release and the release was successful a few hours after. For information that credential is now under the management of infrastructural code. So everyone here is should be able and has the right to update it in the future. However, we have created a new issue on a desk to think and walk around something that will watch the expiration for different kind of credential from certificates to API tokens. That's not the subject there, but we have that issue. So if you have ideas or if you want to review it, the link has been added on the notes. We have the weekly, we have a weekly today. I want to check the status Mark. I did check the status. It was stopped by a restart of the Jenkins controller and was waiting to resume. So there's some bug in the declarator in the pipeline that's causing it not to correctly resume. So I just stopped it and restarted it. It was about 10 minutes ago that it started. So roughly two hours from now, it will have completed the build phase and start the packaging phase. Okay, I really need to either fix that problem or stop merging pull request on the Jenkins LDS token incubation. I don't think either of those is the, yeah. I think certainly there's a bug somewhere and that bug it would be nice to fix because they should be, should be resumable, right? They should be durable, but it's easy to restore. May I ask one of you to open an issue while we're speaking about that? Or I will do it at the end, but we need an issue to track that element. Is a help desk issue okay? Or what would you like? Yes. Okay, I'll do it. Great. Okay, so we'll monitor the image. Is there anything else about the past or current weekly mark or the others? No, I expected to proceed. It looks fine. I've got to review the change log and be sure that the change log is accurately describing all the changes, but that's just usual work. Okay. Next announcement. Next week, there will be a security advisory. So please don't merge anything. I'm looking at you. I'm here at the portal. Yes, I'm speaking to myself. Not to merge anything on the puppets that could impact trusted, just because it might be used for some components. But since this is mostly plugins, we should not have anything to do on trusted or CI and the security team will take care as usual of installing the released plugin on CI Genkinsia and restarting it. Just we have to be careful to not channel anything. The weekly should be proceed and treated like every week. Is there any other question on that advisory? Okay. I will take care of sending a reminder early morning US during the day for a wrap in IRC just to remind everyone on that day. Other announcements? Nope. Okay. So let's start. Let's start with the work did on during the past weekly iteration. We had an issue with repo Genkinsia that has been fixed. That was an issue on Google compute cloud. That was a great opportunity for Stefan to test the work on data dog metric that would speak later. That show us that the current data dog, some of our current data dog monitoring are not working as expected. They always are green. So Stefan is working on fix. I don't have anything else, despite the fact that we are not able to have status Genkinsia being automatically updated when there is an issue on GFrog. So we have to still to rely on the good faith of any contributor that time it was team. So thanks for that. That issue was not caused by us or by GFrog and it spent a few hours before their platform was back. So right now no action item, but yes there is clearly room for improvement in automating the status Genkinsia websites. So let's get started by fixing the monitoring for repo Genkins and then we can take decision once we have a monitoring that works as expected. We have a one fix issue. I won't treat it there. It wasn't the correct issue tracker. All the issues around infra report and report and repository permission later have been fixed. No issue, an SM mistaken. So that's fixed. Don't hesitate to interrupt me if you, if you forgot something. We had the LTS release last week that has been applied to all the Genkins controller that we managed that are using the LTS line. So CI trusted, search CI and release. No issue whatsoever. Almost everything was done automatically when the issue was opened by Alex. So good for automated system. We granted access to Bazel, to some part of the infrastructure. The goal is to give him enough field and access and privileges. So we can help us on diagnosing the issue on CI Genkinsia from the president past outages. We added something with the post mortem that I still have to publish. Bazel will need access to the EKS, the Amazon cluster in order to be able to diagnose. So we have to check with him if he can access the AWS CloudBiz account, which is the actual account running all our machine. So we'll take care of that part. That's why I haven't created a public issue because it's still the CloudBiz account. So it's another process than the usual. Outside this Bazel confirmed that you have access to everything and it has everything required. The issue about the Azure service principle. So you have all the details, oh and oh, we can use the infrastructure as code to rotate as your credential or at least where it is. Thanks, Tim, on the initial issue about tracking, it sounds like we might have really nice solutions to avoid credential at all in certain cases that will avoid having credential. And finally, there were, I'm not sure about missing permission. I forgot, okay, Tim, you fixed that one. Okay, so Alex had to be allowed on the correct group to access really CI to perform the LTS release last Wednesday. Thanks, Tim, for fixing that. Seems like the key clock was terribly slow, which was horrible. Yeah. Still not sure if it's related to LDAP, which is OM killed a lot. Okay, did we create an issue? Yeah, it's on key clock. Okay, thanks, Tim, for handling that and having the patience and handling the frustration created by a slow key clock. That can be terrible. So these are the main tasks that we completely closed and we can't forget about. Now the work in progress, unless I forgot any closed tasks. Nope, okay. Blushen, replacing Blushen in default display URL. So we try to merge it after asking question on the tickets on email. Thanks for everyone who answered. But as cooked by some users, when you enable the classic UI, it seems like that it's not using the redirect endpoint on Jenkins controller, which means that it changes the permalink of on the GitHub check generitates, which means you cannot ask a user to set its preference to classic user or Blushen and you cannot have something that is guaranteed to work in the future. So we roll back the change as on the line and I haven't checked. It seems like it's something related to the display plugin. Is that correct him? Yeah, it looks like a bug or a missing feature that the redirect was not implemented for the classic. Well, if you force the setting with the system properly, it doesn't seem to redirect. So that means it's exiting our scope as infra people, we need that. I'm not sure what is the usual process. Should we, yeah, we might have wants to open an issue on the plugin tracker, right? I would say, yeah, open an issue with the plugin and then just either block or close this ticket until that's resolved. Okay. I propose to ask to Calolav Nimetello because it sounds like that person clearly have a clear understanding of what the problem is. So I will rely on them when opening an issue on the plugin tracker. So I propose that we move that tickets outside of the milestone until something new come. Sounds okay for everyone because I don't see any other action that we could do. Yes. Next one, there are still two open tasks about CI Jenkins.io. There has been two outage two weeks ago. We ran a post mortem last Friday with Basil. I have to finish the notes and publish them. Basil has everything that he needs except the EKS cluster and he's walking around. The summary is there isn't any action expected from the infra team. Basil will open the issues on the desk when he will need. So we have to wait for him. The idea is that there are different clues but he's working on trying to reproduce better apprehend and compliance the issues. So he can share some reproductable tests to the plugins or the Jenkins core parts that are creating issues. We identified that the root cause that was, that trigger all the bugs and unwanted behaviors was related to the Docker API rate limit. So the only thing we can do on our side is knowing that if we are again API rate limited on the container and CI Jenkins.io, then we only have to wait and let the controller deal with the build queue. We just have to be patient. So we are waiting for Basil, no ETA because he has a lot of other stuff to do. The thing is if he needs something, we have to provide him as quick as possible to unblock him. That's the only takeaway. Thanks to the work of Stefan who was split two weeks ago, the accounts for push and pull, we should delay the threshold of the API rate limit. I got news from Docker that now Jenkins is part and this tag has open source program for them. So it's the DockerCon that week. So they have in mind that we need that. They should be able to add our accounts with an almost unlimited API rate limit in the upcoming days or eventually weeks. Worst case, end of May, they're guaranteed. Data dog, there are two subjects. First one is adding a new monitoring probe for repo Jenkins CI. So that's Stefan on that. Work in progress. We thought it could be easy to add, but in fact, we uncovered that all the data dog synthetics tests that we have defined through Terraform are always answering. Everything is working as expected. The synthetics with the cheapest brother. With a type of brother. Type, sorry. So we have to recreate these probes, change them to the types that Stefan selected and tested on a single one. And we might also have a campaign of migrating some of the probes that are run by the Docker, data dog container on our Kubernetes cluster back to the API. Still not sure why do we run such tests, but we have some work around cleaning up the data dog probes before being able to have something working. Deprecation of phone calls. Stefan, I assume you didn't add time to spend on that one. No, sorry. Is it okay if someone else take it since it will start to be quite urgent to take that? With pleasure. Is there someone volunteering on that one? Yeah, no problem. Okay, you want to take it, Hervé? Yeah. Cool, thanks. I'm available if you need any indication or precision on that part. Just know that both of you, Stefan and Hervé, will have to work on the same Terraform configuration so that might have some impacts. It can be fun. Migrate rating, Jenkin, Sayo, to Azure. So the shift to Kubernetes went very well. Great job, Stefan. The only two items to close that issue are we clean the virtual machine on AWS, but we still have to clean the old database, managed database on Amazon, and documentation updates. So almost there we are at 90%, but still some cleanup tasks before being able to finish. Is it okay for you, Stefan, to keep that task for this iteration? Yes, please. Oh, sorry. Thanks. We still have the mirror in Singapore requests. So we are in a work, both Hervé and I, to finish collecting what do we... We weren't able to... We failed to find a documentation about what to tell and ask the people proposing a mirror. How much space do they need? What they need? Where are the instructions for them to start mirroring? I'm sure there has been emails or knowledge, but we weren't able to find something in the wrong book. So... We only found the blog post from Tyler saying that the MFA team will give all the recommendations by mail. You just have to check the Jenkins info mailing list. It's in there from previous posts by Olivia. That's the idea. So thanks for confirming team that we are in the correct direction. So the status of that task for the upcoming iteration is Hervé and I are going to search mailing lists, repositories. We are going to write a run book with the information and then we can contact back on the initial email Fred to tell them, okay, you need that. You have to do that and let's proceed. Yeah, I mean, you should generally just be able to copy basically Olivia's last response to the last one. That's confirmed. Thanks a lot. Hervé's work in progress for building our own Windows image, Windows Docker image in FRACI. So it's mainly two areas providing Windows virtual machine image with all the tooling with the same feature sets as we have on Linux to link the Docker file to build the Docker file and publish them. And it's on the backer image templates. Second part is the pipeline library that we are using. We need a feature parity so Hervé is working on that part, which is mainly being able to execute the same command between PowerShell and Linux Bash. Mirrors Jenkins.io. So yesterday we switched the main consumer from the domain Mirrors to get Jenkins.io. We also changed to HTTPS. I haven't seen an issue until now but either I might have missed or it was silent. The goal is to track all the usages that we could have of the former Mirrors the Jenkins.io domain. I'm working on a blog post that I would expect to be published on the week if it's okay to tell that Mirrors Jenkins.io will be forced to HTTPS and explaining that we will consolidate all the mirror infrastructure using the new one. So everything has been tested on Kubernetes. If you point your DNS for the Mirrors Jenkins.io to point to the IP of the public AKS cluster it works very well. The goal is then being able to remove all the mirror brain stuff. So you're keeping the old domain not just turning it off? No, because we also serve the date center is on that machine and it's used for packaging. I mean Mirrors.jnkins.io you're keeping that domain? Oh, yes, that should be a CNAME pointing to the mirror bits just to avoid breaking installation or proxy on some users. All right, I highly doubt. Well, I don't know how much to it this would actually transparently upgrade. The access lock of the Apache were showing a lot of IPs that were on towers still using it some time to time. Were they using it on HTTPS or HTTPS though? On the HTTPS because there isn't any HTTPS in the mirror. No, good point. I haven't checked because if you check it on HTTPS you see the I think it's fallback or PKG page. So it might answer HTTPS 200 while it's misleading. Go to check. Yeah, I don't know. It's probably not much effort to keep it going to the domain. Yeah, I mean mirrored Genkin say you're pointing to the mirrors still makes sense in terms of naming convention and it's few lines. It's not that big of an effort. Yeah, I assume there's some comment or something saying that this is just a compatibility domain. Nothing in our infrastructure points to it. Exactly. I've tried different things during the weekend because it seems like it is going worse and worse in terms of response time for the data center. We need to do something about that. That will be the priority for the coming weeks as well. But still, I did something during the weekend. There were three post-greSQL instances installed on running on that machine in parallel. And there were a lot of Apache or logs. So it happened that the change I did did not improve anything in we are still spammed by the alerts. But at least now we only have one fully working instance of mirror brain. Most of the issues are related to that error message about the scoreboard for Apache server because Apache server due to mirror brain is configured to MPM event, which is a set of dynamic workers to under the incoming request. And it seems like that the setup for Apache is not correctly done on that instance. But since Olivier and I decreased the site of the machine, we went from 16 CPU to 8 a few months ago. I bet that change, we didn't correctly update it to Apache server fine tuning that might have an impact, which results in a lot of on-game workers that feel the queue. And so Apache starts to answer slower and slower. Most of the time, a reload or a restart of Apache helps. But still, it's an issue. That's an old way of configuring Apache and there are issues with Apache 2.4 with the MPM event configuration. So the goal is to try to deprecate mirror brain as soon as possible so we can go back to MPM pre-fork where Apache work like Nginx with a set of static child processes, always the same set. It's more deterministic and you can pin them per core. So it's clearly easier to scale it up. So right now, still issues on that area. Based on that, we have migrate updates CIG and Kingston as a cloud. So the idea and the consensus is to start spawning an instance of updates here on the Oracle cloud system and then start updating all the delivery and sync scripts for the releases. So they can start publishing on both AWS and the new Oracle machine on the first time. That machine should not be available right now. And as soon as we have the same feature set on both machine, we can switch the DNS in a few weeks on that new machine. That will also allow us to decrease the AWS bill of 3K. So that's an important one. Avery, you mentioned that you were interested in starting working on the Terraform area. Damien, I think the help desk ticket may have an extra two letters in it, three letters. It's updates.jankins.io, right? Oh, yes, correct. I just changed it. I just changed it. So many things. Yeah, OK. But it is updates.jankins.io that's being changed. So the first step right now is to prepare the foundation. So we need an Oracle Terraform project and that will be better to start importing the existing two virtual machines. But that second part is optional. We can do it in parallel, but we need a foundational Terraform project to manage as code. And then we'll have to start to create a virtual machine with Terraform and integrate it with Puppet. And you plan to use a rune-robing in DNS to use both of them for a while? That could be an idea, yeah. OK. That's a nice idea. So I will update that ticket and link back with Oracle. So are you interested in starting the project and starting working on that one? And if you're interested, do you think you can start working on that this week? Yeah. So that one, we keep it for next iteration. The new monitoring probes, so work in progress, as we said. That's on your area, Stéphane. I propose to the realign repojankins.ci.org mission for GFrog. I propose that we prioritize this one for now because we don't have the bandwidth and it's not top priority. Is it good for everyone? Agreed for me. Installer NGD on old Jenkins controller. That's a minor task, but still a good exercise for Stéphane to get autonomous on Puppets. So that's a low priority task. That's the first step of what Basil reported. The goal is to install that package to improve the usage of random number generations. So the first step is it doesn't cost anything and it's an improvement to have that package installed on all the virtual machine managed by Puppets. That won't fix the whole issue. We still need to check how does it work and integrate with Docker container for the virtual machines. And same question on the Kubernetes area because it depends on the underlying host operating system and all the container engine is using DevRandom and Urandom. But first step first, installing the package on the virtual machine is the good first step. So work in progress. Is it okay, Stéphane, to keep working on that one this week or is it too much given the amount of tasks? I found few information that I want to speak with you about on that area. Okay. So you will need some sync afterwards. Okay. So we keep it on next milestone. Finally, Digital Ocean sponsorship. I've stopped the cluster because it's my payment card that is configured on Digital Ocean. We were going to run out of credits with the last week or previous week's rate. The cluster is not deleted. It's disabled and I only removed the node pool of beefy machines that cost a lot. Now we have almost $150 of credit left which will allow to stay for eight to nine months in that state. So next step now is to contact Digital Ocean and tell them that we are running out of credits. So that was a nice sponsorship and then request if they want to continue the sponsorship so we need more credits and if they are willing to and if also they are willing to increase given we are planning to use it more and more given the reliability of the service. I don't know if anyone is interested on maintaining that relationship by default that will fall back to the infrastructure officer, EG myself, I don't mind but if anyone is willing to do it no problem for me as well. Yeah, I'll draft the mail and ask your opinion about it. Nice. Okay. These were the things. I will finish the update at the end of the meeting. I wanted to check with you the infra team sync next issues so that's the milestone where we put it's a kind of back burner but you have to get back the mail, the press mail issue because we had an answer from Kosuke. It's not a back burner. Yes, so Kosuke answered to us that he has the mail gun account so he's going to send it to either Mark or I in encrypted. As soon as Mark or I have the access we will update the issue and put it on the current milestone. But I propose we react on Kosuke event based. Yeah, you're right. Because he might send an email this week or next week. I mean, he's a busy person so I don't mind, but yeah, I don't want to put some ETA for us in that area. But correct, thanks. Thanks for that one. That's a good thing to remember. If we have access to mail gun that should help us to see the list of elements and to create an alias right now. I haven't heard from the Linux Foundation about the program manager officer that should help us to get started on hosting email. So I'm not sure who to contact, Mark. Is this something I should send an email to the Jenkins board to escalate? So so this is your question is from Linux Foundation. Who do we ask for help on the CDF? Sorry, one CDF. Linux Foundation told me to contact the PMO on the CDF. So they asked them for us to be email hosted. OK, so what we probably need to do is contact Andrew Grimberg. Yeah, that's now he's he's actually at Linux Foundation, not CDF, but he can kind guide us on the next step. So let me look up his contact information and I'll put it in the notes. And I will take care of that. On the email part. Um, we have a new ticket that I want to put on top. Airway, as I mentioned last week, created an issue to specify how to have two Azure Terraform project, one for the Azure network, including the DNS, the virtual network, and the other for the Azure. Like what we worked on with writing, the CI Jenkins, your virtual machine and other pieces of infrastructure. The goal is that by splitting into project, we should limit the risk of us breaking down the entire infrastructure, like Olivia and Tyler had issues in the past. That's the reason why they stopped automating the Azure infrastructure management with Terraform. And most of the time it's DNS. And sometimes it's virtual network and it's the need to have two separated projects so we can upgrade infrastructure as code on the Azure area with confidence. Whatever kind of contributor we have, whatever skills, and we can do it quite often without being slowed down by fear. So that will be a nice improvement. There is already a lot of tasks, so are you okay? If we keep that one on top of the back burner, but we start once you will have worked on the Oracle parts. Which means if you are quick enough this week and you are able to fulfill all the tasks, don't hesitate to take that task additionally, but only if you were able to fix the current milestone. Is that okay for everyone? I don't see other important tasks important to pick on the back burner right now. Do you see some? We have the key clock horrific performance. Oh yes, now there is one asked by Alex. Manage new crowding project through LDesk. All right, yes. So the idea is to use, if I understand correctly to use the LDesk repository also for the crowding projects. I don't see any issue. Alex already opened the pull request on the issue template, which is really nice. Thanks Alex for that. Is there any reason for not accepting this request? Do you see any? No, I was wondering how it could also include in RPU request since it's related to plugin management, more than LDesk or Afra issue, but since it has to be added to existing plugin, yeah, it's a good place for these requests for now, but later I think it should be included in the repository permission of data template. I had another question about crowding. Will it also mean with the current translate plugin will be deprecated or uninstalled? Not really because they are separate and independent things. So you can still use the translate plugin if you wish. I think crowding is much easier, but what we hope is that crowding will become the de facto preferred way of doing translations because it's so much easier for translators and for proofreaders. So did that answer your question, Eva? Yeah, yeah, it's answer it. I'll open an issue on this plugin to, I've noticed some issue on it, like it's completely unreadable when you are in dark mode or something like that, but it's minor. Ah, or the translate plugin, okay. Yeah, okay. And for the crowding, yeah, I have just have some suggestion pending on his request. I'd like to add a localization in the title, crowding localization project or something. So it's a little bit more clear what it's for when you don't know what is crowding. May I ask you to take the issue then? So you can discuss with Alex and get that. Sounds good for you? Yeah. Many thanks, Eric. I don't see other team sync topic. Yeah, the rest are bonus. We have the update key, maybe separation at the moment in time. Being able to have a folder with specific multi-branch jobs or on a GitHub organization scanning update on infrast.ti that will only under the update key pipelines. As per suggestion of Daniel Beck, we could use, I forgot the name of that plugin, where you define one Jenkins file and every repository with a token file will use that Jenkins file. So that's an intermediate between GitHub organization scanning, where you need to define a Jenkins file each time you want it to be built. And a full shared library. The token file is only a marker that I enable or disable without having to specify a Jenkins file each time. That will avoid first requiring an under Jenkins file and a Jenkins file case and a Jenkins file dot update CLI, et cetera. That would avoid the complex pipelines such as the one we have today, where as soon as you use a declarative matrix, you have to wait for update CLI to perform its def and apply it before being able to really do things. If update CLI fails, it fails your pipeline. So it's kind of complicated. And being able to manage different pipeline and the only way is creating different jobs. So if anyone is interested in half time, we can have on this, but it's nice to have. It's not the priority. That's all for me. I don't have other topics. So Blue Ocean link. Blue Ocean. But I don't know if, from more conversation. Oh, the CI Blue Ocean.io. Yeah. I have no idea. So we ask, we have that instance. It came from a discussion when Damien announced in Gitter the open source program for Jenkins talker organization. And one user noticed the Blue Ocean image on Jenkins. Ski user. Jenkins CI user was published a day ago and now an hour ago. And so we were surprised and gave in in Tadas. It could be CloudBees, a CloudBees instance. And indeed, Olivier Lamy was using it. So that instance was working a few hours ago and it was mentioning DevOptics on the left menu, which confirm it's a CloudBees managed something. So it's not under our area. However, the Docker image for Blue Ocean, we might need your help, Mark, on that area. I remember that image also being used on the Jenkins.io documentation at some point in time. Not anymore. Yes. About a year, year and a half ago, we removed it. We provided, we switched to a Docker and Docker technique and it still works just fine. So that Docker image, well, I hope it's still updated is not relevant to the Jenkins docs anymore. Okay, so that means we should be able to disable the job that builds that image. And then if CloudBees needs to have that instance or someone else, they will have to use their own Docker hub account instead of this one that will also helps on the work around the UI in Jenkins area. So yeah, we'll try to take care. We are waiting for feedbacks from people inside CloudBees and based on that, we'll create an issue on the help desk to see if there are actions on our site, on that area. Thanks, Eric. Good reminder. That's all for me. Do you have other topics? No. Many thanks everyone. Good job and see you next week.