 doing something yesterday and it failed so okay recording is started. Perfect. Hello everyone, welcome to the Jenkins Infrastructure Weekly online public meeting. We are the 22 of February 2022. Today is an is it anagram? You can take the date from left to right or right to left. Hey, nice. Okay, so today we have Mark, Stefan, Tim under with us. We're missing here, Stefan. I'm here. Nice. First announcement, so I saw notification about release. I don't know if it was finished for today's release, 2030-35. Tim or Mark, do you know or don't? No problem. I haven't seen anything, so I'm guessing. So I have seen that the releases are published to Artifactory. I have not checked that the jobs have completed successfully and I haven't done the checklist. So I don't know and I haven't done any investigation on a docker image or a docker tagging problem. It was reported by someone with grad to JDK 17. Okay, currently checking. We have an automated job that's rebuild the image, so quite and it's visible on Jenkins. I can check it separately, Damian. We don't need to take the time here. If the weekly has not completed, I'll chase it later after I get my problems resolved. Okay. That's the fix today. Okay. But at least on Jenkins IO, it's advertised that it's released, so we only have to do the last checks. Just to note, last weekly introduced a lot of GUI changes. By changes, I include new features and graphical bugs that should have been fixed this week. So thanks for anyone reporting these issues and fixing these issues. Do you have any other announcement? Tim, could you merge the changelog for 2.336 on jenkins.io? I think we already approved it. The docs office hours looked at it multiple times. Looks good. Yep, doing now. Thanks. Yeah, I see the changelog started for the next one two hours ago. Cool. I missed that one. I must have actually know I'm just way behind, I think. So as a reminder, the point for us as infrastructure team is also to ensure that the tooling that we provide to the release team is working and is stable over time. So that sounds like last week happened without any hiccups. Sounds like this week also with the automated. So for once we have two weekly straight without any error on the tooling part. So let's see next time we change something in the system. Just need some IRC notifications when the build starts and finishes. Correct. Good point. Good point. Good point. Okay, I don't have any other announcements. So I propose we start straight to it. So first of all, what did we do? A congrats survey for the work on digital ocean. We are now using digital ocean machines in productions. The Kubernetes cluster has been added to CI Jenkins IO transparently. There are some jobs that run or are currently running. We haven't checked strictly the metrics, but we could because it's reported on that as well. But we saw some activity. So these are about almost 30 jobs on the digital ocean cluster only. I don't know if I can share you a screenshot here. No. Don't worry. Add the screenshot afterwards on the meeting notes. Yeah. I publish it on IRC for now. So we will have to start checking after one week of full usage the costs that we consumed on the OKS, right? Because we have added auto-skilling. Yeah. Currently with 200 pools we have, there is a minimum amount of $350 per month and it can go up to $3,000. If the auto-skillet node goes up to 10 nodes. Okay. So we might have to be careful on that part and maybe reach out quite quickly on two digital ocean themselves. So yeah. For the costs to be checked after one week, after one week, how much did we consume on our credits? I also added this morning labels on these Kubernetes agents. So they have their name in it if we want to target a specific provider in our test jobs. So you can specify your CI, KB test, or Geo KS level. Okay. Thanks a lot for that. That's nice if you want to avoid that cluster or specify it for a specific test. Erwe, did you write an email to the developer mailing list? Okay. So that's to the list. Let's say cost, measure cost consumed. And also, do you mind updating the CI Genkinsia documentation? I will give you the link. The page where we list all the agents and labels available for developers to add the two new Kubernetes labels that you had it. So they know they can opt in or opt out if they start to see issue and there will be autonomous for continuing working if we don't respond immediately. Are they already in the sponsor page on Genkins.io? If not, maybe. So first the first the doc for developer. Yeah. Yeah. Yeah. Okay. And then, yes, we have digital ocean. We need to improve and nurture the sponsorship. We need to check if they, as you said, are they on the sponsor page that I don't know, Mark or Tim, do you know or even Erwe? I'm asking. I didn't check. I thought we had them added, but it's an easy check. I can take the action separately or others can check it. Okay. And there is a blog post to start on that topic on Genkins.io. I can't see them. I can't see digital ocean on the sponsorship page. Okay. So Erwe, do you mind driving also or just to do the pick on digital ocean or ask someone else directly? The goal is not for you to do all of them if you can't or don't want, it's just for you to drive and ensure that someone do this. Tim's right. We've got Datadog, Giscours, Fastly, the kind of entry for digital ocean. Yep. Okay. Nope. They should. So it's definitely in the to-do list, right? Yes. Cool. Are there any question, interrogation from things on the digital ocean? Cool. Next step, the other major topic particularly targeted to the infrastructure cost is the Azure AKS cluster. That's the top priority for now, at least from my point of view, because first price, second security. We now have a general reusable Terraform tooling that we shared on at least three different Terraform projects, digital ocean, Datadog and AWS. So the next step now is to re-boostrap the Azure Terraform project that used to be defined as code. And then it has been run last time before I joined the team. So at least one year and a half ago. Since then, the configuration has been changed only manually on Azure. So it has drifted from the config as code, even though everyone tried their best to keep it updated. Thanks, everyone for that effort. So my proposal, that's a proposal so you can say no, you can stop me. But my proposal is start from an empty repository again. We can always use Git to consult the order knowledge if we need to. And we start from scratch. And we'll start to add Terraform resources, so the new private cluster to be created, migrate everything then. And then we start either import existing resource or create new ones on that project with modern tooling, without interfering with the legacy states. Does it make sense? Is there anything that could cause problem? That sounds good. The main thing that's good for my point of view to get sort of will be the DNS, which would just be an import. The upcoming resources I see in the short term are the new private cluster, I guess, instead of the temp that we have right now, the DNS, reference to the CAA DNS recording, and the new MySQL database for writing. That's not priority, but that's a good exercise to get started with. That's the status on Azure. On the area of our dependencies to be kept up-to-date, it's like a never-ending topic, of course. Thanks, Stefan, for the effort on keeping all the HashiCorp dependencies updated. You drove that change. So Terraform, Terraform modules, Terraform provider, and Packer, all these HashiCorp tools are kept up-to-date, thanks to UpdateKli and Stefan's work and tracking. There is some work around Golang when we need Golang on some Terraform or OpenVPN repository, for instance. This one weren't updating since a few weeks or months. So Stefan is currently working on that part. It's not as easy as it is. So almost there. We plan to do some contribution to UpdateKli because either we are slowed down by, let's say, unwanted behavior, or we want to improve our ability to define an infrastructure resource on one repository, like with Terraform, example security group for AWS, and then get the real name of that resources tracked by Terraform and update it automatically on another repository, for instance, Jenkins in Jenkins Configure code, or update automatically documentation, like you add a new label on Jenkins, you want it to be automatically updated on readme.markdown, because these documentation are important for our users. So great job. Yeah, let's go ahead. Is there any question, things are clear or things I could have forgotten on dependency topic? Okay, continuing then. There has been a request from security team around AccountApp key cloak. So it's not the first time that this request is emitted to the Jenkins infra team. It has been for months, even years. That's the project of migrating account app features into key cloak installation. So account app is an application that helps a user to create and manage their account on our LDAP and centralized authentication for CI Jenkins.io, for Shira, all other items. Key cloak is the same, except that key cloak allows us to have a public front end and admin back ends for administration. It provides way more features. And it's not a homemade application like account app. So the risks are, let's say, a bit lower using that. And that project was never finished. There were some blockers and I will need help from the people who worked on that in the past. I understand that the work that Danielle did on the matrix 3.0 Jenkins plugin, remove one of the last blockers. I don't know if there were other blockers. I've listed here on pure infrastructure, the three main blockers or almost blockers, a step that we need to fix in order to fully switch out from account app and dump that project forever. Not that it was really useful, but key cloak is safer as for today and for the future. So I don't know if there are other elements that we did not track on that list, things that you would remind Mark or Tim or the others? No, it was just a matrix 4.0 thing and the group naming thing or something where you could create a user, you could possibly create a user with the same name as groups. The key cloak experiment is from a long time ago. Other tools have been studied like decks. Have we to go with key cloak, still go with key cloak? I don't know. Because there are some orders with some, I don't know. Decks wasn't providing air back. Decks was only about identities. But it was some time ago and maybe the situation has changed since, I don't know. Good question. We'll have to ask... And Olivier. So just a note on that. If we are going with another tool, there is a security assessment to be made, I think, as it might have been for key cloak already. I don't know. One thing though, we have clarified with Vadek the fact that the security of the infrastructure in itself is the responsibility of the infraofficer. It was a long running subject between Olivier and Daniel when there were respective security and infrastructure officers. I don't know if the board had time to validate that change proposal. I don't remember. For me it was okay and for Vadek. That means that the account key cloak sits in the middle of these two areas. So that's why we should ask them. But we might expect that it's also our responsibility to define that. So we have to ask them for what are the requirements from their point of view. So we need them to act as consultants and not as doers on that. Sounds good for everyone or are you agreeing, is agreeing? Yes, I agree. Thanks Mark. So I don't know if anyone is willing to work on that topic. Let's keep it. By default, I plan to keep it as infrastructure officers since it's my security brand new responsibility. But we need to finish the private Azure AKS thing because we need a private cluster to run a key cloak with it. Another small request that anyone could take. We have a Jenkins controller named CERT-CI managed by Puppet likes Trusted and CI.j. It's only partially managed, meaning we managed the Docker-based image and the virtual machine, but we don't manage the Jenkins configuration. That instance is dedicated to the security team. Badek has asked us on other channels if they could have Windows agents. So the goal is to add the same cloud configuration for GICAS can enable it on that instance. So they would have the same virtual machines, ephemeral agents as we have on other Puppet manage instance. So that's the requirement. They are okay to add Jenkins configuration as code to configure agents. So if they are okay for that, we just we only have to proceed. That will be managed adding Yara data for that specific machine that should be work with the same image and semi-config exception. I've listed what is needed. Two elements are as code. We might need to ask them to insert a credential because credentials are manually managed. So I don't know if anyone would want to try this one. Do you think I'm able to? Yes. So I would if you agree. Yes. Is there anyone to gain? So four. I agree for Stefan doing it. Yes. Plus one for me as well. That sounds great. Yeah. The only thing I'd say is generally your default route doesn't include that instance. You need to modify your config on the VPN or locally on your machine. Oh, yeah. Yeah. In order to get access to third CI, right? Yeah. I mean, you can just add your route to it locally or you can change the VPN to add your route to push your route to it. It's the same thing as third CI. Sorry. Sorry. Trusted. You need an SSH tunnel to reach it. No, you don't need an SSH tunnel. You just need a route added to it. Okay. Thanks. Yep. Let me write this down. I'm sure I remember where the VPN config is. Is it in the local VPN? Yes. Don't worry. We will double check this one. That's a good reminder in case because we would have forgotten otherwise. I think the runbook might have the definition, I'm sure. Yeah, at least once. Cool. I have access to third CI by default. So worst case, I can always compare other settings to mine so we can find what is missing. Yeah. You have access. I think you're one of the few. Okay. Thanks. So, Stefan. Thanks. That will be your next task. We don't have an help desk issue for this one because third CI is kind of, let's say, area that can go some city very quickly. So that's why we have to keep talking internally. I don't think that's sensitive. That should really have a ticket. Yes. If Vadek is okay with that, I'm all on help desk. So your first mission, Stefan, is ask Vadek. Convince. Convince. Yeah, it's just they are okay for us to speak it publicly on that meeting. So I agree with team that should be on help desk, but we never know. So better to ask first, better safe than sorry on that area. But yeah, you're a correct team. I agree with you that should be public. Easier to track and to delegate. Yeah. I think you might have added third CI to most people anyway. I think he syncs the conflict to be the same for all the files in December 2021 on the 30th of December. Yes. So it's pooped up light. Olivier was able to negotiate with Daniel to enable it again because they were required to run the updates. So it's updated at least for the core Docker image. So that's why it should be easy. I just mean I think you've made all the routing the same for everyone at the end of December. Okay. So that should be okay. I think you just applied your conflict to everyone. Okay. So I didn't hear your the beginning of your sentence maybe. There is also an ongoing issue from security since a few weeks around disabling anti-spam for the cert team. I've put the link that's the help desk. Just a reminder mark. So around Christmas I think after in January a few weeks ago you blocked thanks a lot an IP at the table level on the CI Jenkins virtual machine because there was spamming CI Jenkins. It was done at the Linux scanner. Can you remind us or does someone remember because I was out so I don't know what has been done after that. I've done nothing since then. It's embarrassing to admit that I've done nothing at all since then that IP is still blocked and we just need to take an action to probably by now unblock it and watch that it's not an issue and get rid of that single IP table hack that I put in. Assume you just restart IP tables and it will be gone unless you modified it on disk. Yeah. I guess that we have may already be gone and great even better. Yeah. IP table just does it while it's running by default. You have to modify files to get it to save. I love the manual blocker. I'm taking this one if it's okay for everyone. Yeah. That would be great. Forgive my not not taking a further action but yes that more than happy to have somebody else undo my damage. The other thing back when you mentioned spam the cert because it's just a similar thing is that you can see our dev mailing list is still going into spam even after the change that Mark made a few weeks back on the Google groups. Right? Yeah. Yeah. I see the same thing for Jenkins users and several others. I have to just monitor my spam folder. Yeah. I'm doing that too but it's annoying. I remember Henri who shared something with me. My email foo is close to zero. I have no idea of this spamming stuff works. So I will defer to someone who knows or I need to do get myself complicated. It's really complicated. Unless you're Google, right? Yeah, but it's from Google. Google to Google is missing. I don't think we would be able to do anything about it. Okay. We reconfigured. It's Kevin who showed us the send by the mailing list or send by the user. It's apparently configured to be sent by the mailing list. So about that I don't see what we can do about it. Do we have DKM or SPF records or for Google? The settings available is you send the email from the mailing list or from the user. It's about that. I am not sure there are more parameters. Do we have an help desk issue or with people complaining or let's say elements that I could use to indicate myself? There is, but it was because we thought it was fixed, but it's definitely not fixed. So can you share the link on IRC or in the meeting notes? So if we see that again, can you share knowledge and elements on that issue so anyone could take it or start doing it? It's not a big priority, but it's really annoying for most users facing that. So if we can, anyone with email knowledge could help. That could be great. It was fixed, but no technical elements. All the stuff I see on Google is about spam two Google groups, not the other way around. And an upcoming topic. So what you said about IRC notification. So let's say a next big thing once most priority has been done will be enabling custom IRC channel used for notification, maybe multiple depending on the topic. I still don't know. That's another part where I don't know anything, but Hervé and Tim, you seem to have worked on Jenkins Infra-notification IRC channel. I don't understand why, but I can't debug it since it's running untrusted. In the Jenkins Afra-puppet code, the battler is configured to post and to join the new channel I've created. I can't debug it. I can't log on the machine to see its log since it's running elsewhere. We discussed about moving Infra reports from trusted to Infra-CI. I think this one could be moved too. Okay, so Infra-CI. So let me finish on the notification topic. So just three elements. We have puppet notifications. So just to say earlier, we have to work on that. I propose we'll work both of us Hervé. It's because we have the puppet master virtual machine that controls the puppet stuff are applied and it receives message when a puppet agent apply its puppet manifest on a given VM. That machine is almost not managed as code. It's kind of hidden somewhere. This is also where I understand based on the recent feedbacks that that machine is responsible for connecting to IRC and sending messages. So that's where we would have to interact. Maybe moving the puppet bots from there. I don't know how it works. We will have to diagnose at least the both of us. But about notification, we also have Infra-CI and release CI notifications. So the notification from these two Jenkins instances. Infra-CI could be on the same channel as Puppet. But the goal will be when we have one of our terraform, Kubernetes management, or even Docker images, jobs or packer. When the main branch, the principal branch of this repository fails to build, we should receive the notification at least. I don't mind for pull requests that will be too noisy, but at least the main on the main repository. So we know it's failing and we have to fix it. As Stefan discovered earlier today, OpenVPN wasn't able to wasn't able to update its dependencies in a few weeks, for instance. It could be worse. It could be Kubernetes management. So, yeah, sorry? No, I was thinking we can also have a specific channel in our proprietary slack, but complementary to this IRC notification. Yeah, first IRC. And as Tim and Daniel suggested during the past days and weeks, we might want also Jenkins release notifications that will receive messages from release.ci to help, or maybe the Jenkins release channel, I don't know. But when a weekly release starts, it sends the first message and when it's finished, it sends another message at least. And same for LTS. So that will be part of the release process when a release starts and finish. I propose we start with these three ones and see how it helps us. And as you say, they're very important ones in fraud reports to be migrated out from Trusted to CI into FraCI. This involves using switching to GitHub app. I will add there is an help desk issue for that. The initial trigger was costs and availability of agents on Trusted CI. And also, there has been issues reported by Raoul, Aruborosa, one of our contributors, where at the credential GitHub use for in Fra report is lacking some permissions and some plugins, resulting on the maintainer, the correspondence between plugins and maintainer and plugins Jenkins.io not being correct on some cases. It's the GitHub user Jenkins admin which was used to retrieve this data. And since Jenkins admin user permission has been reduced to the minimum in December, I think these data are incomplete since then. Exactly. So the goal would be to switch to a GitHub application dedicated only for that. Team gave to Irvine High at least the GitHub app administration right on Jenkins CI so we can create and manage the application. We need an administrator right to install it or update it on production on Jenkins CI organization, but that will allow us to generate credentials only for that that are strictly scoped and that doesn't depend on a GitHub account. So really better in terms of security because any admin can change it, but still it's a GitHub app that can be pretty tight scope. That GitHub app could not read private repository content, for instance, only plugins. Okay so that's a lot. I took on myself to add online 92 an option to handle the new in case of spamming on the kernel to use fail to ban to automatically lock IPs if you want. You can erase that or not, but that's 90. That's interesting. Never used fail to ban, but yeah that's you already did. I heard it was easy to manage, right? Yeah, absolutely. You just have to. Sorry. Go ahead. I think I also mentioned CrowdSec, IP blacklist, crowdfunded, but I don't know how easy it could be implemented. I think it needs an agent. I know fail to ban, which is quite easy, but I don't know about the crowd. Fail to ban is independent on CrowdSec. He's collecting all IPs of fail to ban from. Oh, you're merging all the information on the on the. Yeah, and it's crowdfunded. So you you benefit from. Okay, please correct my misunderstanding. I thought that fail to ban was a tool that given a list of IP address to lower deny. It was blocking automatically connection at the system level. No, fail to ban is watching your log files and is taking an intensive action depending on what you tell him to do. For example, if an IP fail on three or four times to log using the SSH team on SSH, it will be locked out, but we can do whatever we want as a declaration for fail to ban to ban an IP. For example, if we have log of CI Jenkins IO of a spammer, meaning that the that IP tried to do something three, four, five time and we got that in the log, we can tell fail to ban. Okay, you have to ban that IP and it will first ban it for like one hour, second time for two, eight and years depending on how often it's coming up. Okay, so in the logs, if correctly configured by checking Apache logs, could it do the same as what Mark did manually at IP table level? Exactly. Okay, so I understand correctly. So fail to ban and close sharing. Yep, it's not sharing any information with others. Yeah, like, like, quote, say, quote, say, it's sharing ideas that are completely different bad behavior and are kicked right away. There is no sharing with fail to ban is just using the current log in the system. Yeah, and I think not sure how crowd-seq is easy to implement. It might need an agent and so on. Fail to ban should be more than enough. There is something around crowd sharing that we need to make sure is that the sharing of the IP is compliance with any code. I'm sure you can only allow inbund, so you retrieve the external IP and you don't give yours. Yeah, but that's kind of bad if you're using the system. I will not tell you that, you know. Okay, interesting, I didn't know, I knew fail to ban for SSH. It's installed on most of the cloud server and works. Yeah, that's because it's a default way of working, but you got tons of ways using it. Okay, so I will check for that one. Thanks. Welcome. I think that's all. Do you have other topics or things you want to add or want to clarify? If it's working, there is an update about mirror. Not updated, I'm not sure. It's not only Alibaba mirrors. Let me find the issue. Oh, that could be the whole mirror system not updating. That one might be important to check first then. Okay, thanks, I missed that one. Yes, it's the helpdesk issue. This one. Okay. Thanks a lot. So we got to check this one. Sorry for one old topic. The 2.336 release is not complete, at least in the sense that I can't see Linux installers yet. So I've got to do some more research. Nothing, no actions from others required yet, but when I look at, when I attempt to do an install of a weekly on my test machines, the weekly, as far as I can tell, is not visible yet. It's visible there and that's a good sign. So just needs more investigation. I'll do the research later. Okay, just to be sure, because if it's available here, that's weird because it means that the Linux repository should have been updated already on their index. So that might be something weird on the deployment process. Thanks. Not to state if you need help or additional tests. Thank you. Cool. I don't have any other topics. Do you have some? Okay, folks, you deserve a, you deserve a coffee or tea or whatever. Thank you. Rock your boat. Have a nice day. Thank you everyone.