 clouds. Good. Hey, hello everyone. Welcome to the weekly infrastructure meeting for the Jenkins public infrastructure. We are the second of November. I hope everyone had a nice weekend. First of all, an announcement. We will have a new LTS upgrade this Thursday for November. So please restrain merging and deploying changes at least the Tuesday, the Thursday morning on the US. Ideally, we should also avoid Wednesday, but we never know. That LTS release will also be synchronized with a weekly release, of course, with the security fixes involved. There should be no functional changes on this weekly release, if I understand correctly. Any merge functional changes from the past week will be in two weeks for the next weekly. On that topic, we still need to find a way to disable weekly releases because apparently the weekly release was... That will be the first... Yeah, it's not announcement. That's why I wanted to finish the announcement first, if you don't mind. Are there other announcements for this week? Not that I'm aware of. Maybe related to the election. On Sunday, the deadline to restart for the election is on Sunday. Next Monday, I'm sent invitation to vote and we'll communicate about the candidates. That's all for me. That's really directly related to Jenkins. Still an announcement. Thanks. Cool. Okay, so we can start by the elements unless someone has the last announcement. One, two, three. Okay. So now, first, we discovered, not sure if it's something that we used to have because I remember we have seen that behavior before, but when the release team is working on security a few days ago and when they disabled a weekly job or any multi-branch job on release CI and infra CI, we have a job that regularly applies the configuration as code. And so each time the configuration as code and job DSL setup is reapplied, all the multi-branch jobs are enabled. I was not able during the past 15 minutes. So let's say if you know how to do that or if you are better at searching documentation on the understanding job DSL than me, please let us know. But it seems like that it's not possible to disable multi-branch job or GitHub organization through a directive on job DSL. There is a directive function disabled on the freestyle Maven and simple pipeline job, but I didn't find an obvious one for multi-branch, which means it changes the state of the job, but it's not exposed as a configuration item. So we must use it, use the UI, which is quite annoying because I had to disable the Kubernetes management job for now, and it should be disabled until Thursday because it's trying to run again new releases. Could you just change the pipeline instead maybe? If security file exists, skip job, skip weekly. I'm not sure to understand. Yeah, not understand either, sorry. So maybe change the pipeline. So if variable equals secure next release security or something, well then just have the pipeline skip itself. That will be a short time setup that will allow to enable gates again. Yeah, rather than people don't manually they PR to the pipeline. It seems that Daniel told me that it should be okay, there shouldn't be a new one. So I assume that the script is already, once you have built something, it won't be able to gain or I don't know, but there is something there that should be his, but yeah, that's a good idea. Or maybe use it until we are able with job DSL to disable it. I had a message from Jenkins contributor that told me that maybe it's only not implemented on job DSL. So that should be a contribution to do because there since multi branch pipeline are inheriting from the standard job structure, that should be easy to reuse the disable is just not exposed. And it's already changing the state on Jenkins instances. So there is already some code that's at play there. So maybe something to check, but be careful when you apply changes on the charts repository, they might be skipped, unless we are carefully enabling the job, running your changes merge on the master branch, disable the job and ensure that you disable the weekly job as well. So please communicate if you want to do that prior to your change, so we can ensure that we don't slow down the security team. And what about just commenting the part in Jenkins if we're such a history that configure release at CI, so we are sure that we don't modify that configuration. So I mean, I'm not sure that that's delete the job truly and simply. Yeah, I'm not sure about the behavior of job DSL there. No, I mean, I mean, directly, so you see when we have hand file released all the application we want to have on the community's cluster, if we just comment the release, so the one that configured release at CI, it will not try to configure that one. So that's what I mean by just pushing it. Yeah, okay. So we are able to configure the rest. I mean, we would yes. Yeah, that's a time fix. Okay. Who is willing to pull request that one? That's an easy one. I can definitely do it right now. Not right now, but it's just to be sure we lead the subjects that will help. Thanks Olivier. Okay, so you confirm a question that Erwe and I had last week. Our system using Elm file is not able to uninstall a Helm chart. We need to do it manually when we want to clean up something. Yes, I think. Okay, can you go back on the notes? Okay, so with the two temporary fixes that Tim and Olivier proposed, we should be able to be safe to walk, continue walking, and ensure that next security release should be okay. I'll take care of communicating that to the security team if it's okay for you. Once we will have implemented at least the first one. Is there any other command questions, something not clear on that topic? Sounds good to me. So I'm moving to the next item, plug-in site. Last week we were having a DNS issue during the past team meeting, so that issue has been solved by switching from Alpine to Debian Base. It's not because Alpine is buggy. It's because the base Docker image used for the back end of the plug-in site is the official JT and they dropped support of Alpine since two years. So that was a really, really hard Alpine image. So instead of building our own, we switched to the latest, which is Debian and it worked and it fixed the issue. So was it related to the Let's Encrypt root certificate change? Absolutely not. It was only DNS and absolutely not DNSSEC. So no way it was with Let's Encrypt. That was the symptoms where randomly some domains were not receiving DNS answer. While checking the logs of core DNS in the cluster, you saw the request piling up on the core DNS agent on the walker where the pod was running. That is a characteristic of an old Alpine version, not able to deal with the DNS system inside the cube. That was Alpine 3.9. It's really hard. And so the logs of the application were saying, oh, I cannot resolve ci.jenkins.io. It was able to resolve other subdomains of Jenkins.io, but not randomly CI Jenkins.io. And if you restart the pod, that might switch to another domain randomly, which is a smell that, okay, you have an issue at low level DNS there. There will be some work on using our library that should help keeping track of these old images because we did not update these dependencies in tiers. So we need to update CLI and stuff there. And we have an external contribution from on to a recent CloudBees employee who started working on that Docker image specifically. And we have an external contributor that is also willing to work on that. So let's see. Is there any question changing that topic? So is there a general way that we can check for outdated images like that as they arrive in the cluster or as they're running in the cluster? Or is it not a, I'm just thinking in terms of the health and the monitoring. Is it something worth considering looking at what's the base OS version or that kind of thing? I think you need a security tool that would analyze based images because in our case, we don't know. I mean, that's the same with the Jenkins, the official Jenkins. I mean, the official Jenkins image. If you run Docker full Jenkins, you'll get a very old Jenkins version. And I mean, unless you use a security tool to tell you that it's an old version or if you know the context of the Docker image that helped. But in this case, it's really hard to, I mean, to manually detect that we were using an old Jenkins image. Yeah, we have it with the container security products that runs in our cluster and scans all the running images. Yep. And for the use cases of some behaviors that are critical in terms of security, that one was not in terms of security, sorry, performances and security that was performances there. FALCO has a set of rules and we have a FALCO installation on the clusters. FALCOs has some rules that you can apply that check for some well-known behavior. So there are two links. We can, but it's a matter of time to spend on that part. Thank you. If someone is interested to contribute on that domain, improving the security of our infrastructure is definitely one of those domains where we need help because we don't necessarily have the time to work on those. Thank you. Just for the context, I think with the work that the Linux Foundation is doing with LFX security v2, we may, we may be able to detect that kind of issues. I'm not sure. But I think there are, they can analyze the profile from each repository. Yeah, I understood that LFX security is checking the images before they are deployed in the cluster. They don't run real-time analysis on the cluster like FALCO or some other container tool styles. We need both. We need something on the supply chain during the build of the images, but we also need something on runtime because you don't know if all the container images are checked by LFX security. If one of the repositories is not under the standard procedure way or if you use an external image, you are not sure. So it depends. They were speaking about an admission controller that works exactly like FALCO, but still there are some work on analysis to do that. Yeah, that's all for me on that topic. Something else? Okay, let's go ahead. Just a note on the AWS costs, which is one of the main priorities these times. We started some work on using spot instances on EC2 that the experiment worked really well. We were able to cut costs during the experiment per four, which is really impressive. And there is a lot of features that are not documented on the plugin on the EC2 plugin for Jenkins. So incoming pull requests and the final configuration the pull request for applying that on long-term on our instance is coming. It works very flawlessly even with Windows missions. So that's really a quick win. Next step will be to do the same with the KS cluster with the walker pool. And I cooked that the Azure VM plugins also has directive around spot instances on Azure as well, so that might be worth it to enable, but I haven't checked. I don't know if team or someone else have already played with that. I tested that it worked when someone's in the pull request for it. I think I rewrote the pull request, but yeah, it works. You get a spot VM, it sits at the max price. So you should never get evicted. You just might get it cheaper. And Tim, my understanding was that when we bid the price, we're actually charged the current price, not the bid price. Is that correct? Have I understood right? Yeah, so basically, you shouldn't lose out on anything because we're bidding at the regular price, but we'll take anything less than that. So you should always get something. And if we get evicted, Damien, will we see that as build failures? I mean, we get some 60 second notice or something like that. We're about to be evicted, but that won't be long enough to finish most Jenkins plugin builds. So I assume, go ahead. So now the build will fail because the agent will receive. So Jenkins will be aware and will note on the agent logs that it has been evicted. So I assume that the build failure should be able to tell us that, at least for EC2. However, Jesse Lee recently stated that there could be an improvement. I saw you mentioned an issue on the public tracker about being able to retry builds when the cause was, when the cause of the failure was the agent died or was evicted or whatever, depending on the implementation. So I understand that as we will see build failure if we have an eviction. The EC2 fleet plugin has that feature and someone sent a PR to port it to the EC2 plugin, but I don't think it was ever merged. Okay, interesting. So that means we might see some build failure on that port. Most of the build on CIG and can say you have a retry will be retried quite often. So even if it's annoying, like for the ATH, that might be an issue. We have to see in reality, the strategy I proposed there is to bid a bit more than the current price. Like right now for IMEM instance, it has been 0.16 for the hour in dollars. So bidding at 0.20, which has never been reached, you are sure that you always have a machine. Because right now the thing is if you bid a bit more on EC2, they are automatically putting you on what they call spot block, where you say I want a spot price and usage guarantee for one hour or two hours or six hours. That behavior has been deplicated. It's still working until next year for the customer using it. But what I understood from the EC2 support is that if you bid a bit more expensive than the price market, then you won't be evicted because they assume that will be a six hour block by default for now. So we should benefit from this behavior until next summer based on what they wrote on the support. However, the risk is not zero and we might see some build. So we have to be careful. And do we have a way to detect such eviction? Do we have a way to measure if just improving or decreasing the situation? Because as far as I know, we removed the DataDoc agent, so we don't. And we discussed about having working with Elastic to have the open telemetry in place, but we have not. So maybe it would be nice to reprise. Right now we are blind. So if that thing happens, we rely only on the user being mad. And when they are mad in us that the threshold to alert us, which might not be the best experience ever for them, at least. What do you think about we communicate to the developer mailing list that we did that change and we communicate to the end use to the developer that, okay, if you see build failure due to an agent dying, please contact us because that might be related to that change. And we want to be sure because we are not able to monitor it carefully right now. And that means we should start also the open telemetry usage as well. I think we should find some way to start working in open telemetry. Because we have to rely on people crying. I mean, that's something that it's not really reliable. Yeah, the mail to the that was only a joke. We still need to communicate that information because as part of the platform, people need to know because I don't want people thinking, oh, I did something wrong and my bill failed because I did something wrong, which is the worst case. But you're correct. We should think about going back to work on open telemetry. No more question on that part. So the priorities will be spot instance for VM spot instance for EKS and then on Azure because Azure is not a priority in terms of keeping the cost at bay. And Damien, in terms of how long will it take for us to decide if this has had a positive impact on infrastructure cost? Is that something we can measure relatively quickly or does it take a month for us to know if it was the cost of it done? So usually it seems that a week is the time slot because there are a lot of bills during the weekend, the Sunday. So we have a peak on Sundays or Saturdays. But during the weekend, there is a peak each week. And the billing allows us to see the unblended costs daily. So I try to check weekly these times to see the impact. So it's been two weeks we changed with the labels. We saw a trend, a decreasing trend. Even the peaks is smaller, the two past peaks. So we have decreasing trends. So now we should have one or two weeks before we start things. Excellent. Thank you. Thanks very much. That's all for the AWS. Unless someone has a question. One, two, three. Okay. Next topic we have is the work in progress on the wiki. So Erwe is working hard on putting a static version of the wiki content. That should be in a form of a static web server inside the Kubernetes cluster. So static, that means all the HTML export is put under. That's currently being tested. So there is an issue for that. Erwe, we have checked that Erwe is autonomous in any case for working on that. But we tried to work on pairs from time to time. Anyone interested, say hello. The idea is to have a static image that's quite simple and that should be enough to have the deadlinks not dead anymore, once it will be deployed and the DNS would have been moved to the AKS cluster. I put a link to the two repositories related to the wiki. So the first one is Confluence Data. So it contains Confluence Pages exported to HTML. So that's the thing that Kevin was mentioning. You broke my link. I don't know how to put the comments. So this is the idea is just to have a Docker image. You can play with that Git repository. So just Docker image and Genics and that's it. And then I don't think they mean you already mentioned public charts, which is a new Git repository. But Erwe is also working there to create a hand chart that would allow us to deploy exported data. You gave me the perfect transition for the next topic. Yeah, exactly. So that's why. So if you want to contribute or play, just go to those two Git repositories. You don't have a lot of content at this stage there. So it should be pretty easy to understand. I will recommend starting commenting the Girai shoe, which I just added, the Infra 3092 to synchronize and see I want to work on that part before sending code. Just to be sure no one is scoked off guard. So the next topic is around the cleanings up. We have different areas where we need to remove that code, remove that dependencies, remove union things. In the area of the charts who are working, there are already some issues on that topic to clean up the end file, the location of the end files. And so Erwe started a public chart repository that should be aimed to provide end charts to someone else than the Jenkins infrastructure team, which means the development methodology, the testing methodology and the deployment lifecycle release methodology on that repository would be with versions and release should be, let's say, more strict and more clean because it's aimed to not only us, but someone else compared to the charts that we have. Sometimes we want a chart which is latest version always with just a few objects. Also, the goal is also try to split the logic between an end file and a chart. A chart is a software that you should be able to install anywhere while the end file or the definition of what do we install on which cluster or now in infrastructure. That's the reason of that cleanup. It has been started already with the public chart, with the Wiki chart. We uninstalled during the cleanup last week something that was a mirror running inside AKS that was aimed to be a demonstration of how to install a mirror if you have a Kubernetes cluster at home for the infrastructure. Instead of hosting that mirror, which is used and cost us some dollars on AKS, the goal is to provide the associated end chart in the public chart in the future so we can advertise and show that as a resource. In the cleanings up, there are some incoming Puppet pull requests which should be deleting everything related to Kubernetes in Puppet because Puppet is not managing Kubernetes since a few years now and there are still some manifest and things. The reason is because if we want to start working on using the OSUSL machines as K3S cluster for providing workload to CI Jenkins IO, the risk is to mix accidentally different kind of Kubernetes code. Let's start by removing that code not used anymore. See if it breaks something inadvertently, that should not, and then we can start to work properly on the new Another thing that you want to clean up is Confluence resources. Yes, Confluence and Jira resources as well. And finally, we have cleaning up of issues in Jira so that's a weekly work in progress with Hervé, Olivier and Hai that we do sometimes trying to clean up old issues that doesn't make sense, close the incoming issues or migrate them when they have been created accidentally in our Jira tracker. So as we said, I think it was two weeks ago, the user experience when opening an issue is terrible. That's a topic that we should bring to the advocacy team because there is no nothing that helps the user when they create an issue that helps them to know if they have to choose infra or Jenkins project or something else on Jira. There isn't really any help. So that's why we end up with so much people opening issues related to a plugin or an installation of their own Jenkins when they should go be redirected either on the community, on the forum IRC or maybe on the Jenkins project. So we've got, is the work that Gavin Mogun did on the plugin site for report an issue already a beginning towards that? When I click report an issue on the plugin site now it takes me to a page that presents three choices. Do I want to create a bug, an enhancement or a security issue? Is that kind of concept what you're looking for or something different? That could be a great help. I don't know how much we can rely on Jira behaviors because for instance we realized that if by any Google search you ended up on Jira looking for an issue inside the infra, the infra project is put as the last default project when you create an issue the next time. So maybe some setup on Jira is needed. I don't know how much of Gavin works would help to overcome that kind of behavior. Yeah so as far as I could tell he pre-defines the project as part of that URL that he's submitting but it means you're outside Jira so it won't help in the case where you're already in Jira as you describe. That's a good one. So that should help but yeah that's a topic I would want to bring to advocacy. We don't have enough issues to justify spending too much time on that as part of the infra team but communicating that issue can be a great help for the user experience on the advocacy or general community experience where it could be considered as a priority. Thanks for I didn't know that we had that for the plugin that's cool. Oh oh and you've got the redirect embedded in that very good okay I wasn't didn't wasn't clear how it worked I just know it works really well so. Oh a nice one okay. Yeah so that's all for the clean cleanings up some less some wonder as a question something to add. Okay let's jump up. Olivier that one is for you digital ocean so last week we we were not able to see the credits you contacted and had a response from Lauren from digital ocean what's the status that's true so she she confirmed that we received the credits but we still don't see it in our user interface so she's looking at it and try so she yeah she raised she's shown us under on digital outside so just waiting so let's wait for when we are sure that we got the credits can you keep that item so no everyone is on receive the email of the extent because I I did mark as well yeah so there are multiple people in copy of that discussion and I think just this morning we we've changed multiple emails so it seems like things are moving okay so next topic is packer image two minor topics there first one there is there has been some work last week on update CLI to ease the how we change a value in a file because that was not really user welcoming it was working but it was creating some uh ghosts pull requests were like pull request that say update gdk8 where in fact it was changing something else on another line sounds like that should be tool change with that change that should help so minor and then as a minor elements uh mark and hi started to walk in pair on working on putting the docker images for the infrastructure on cig and kinsayo so for the agents as part of the packer builds to be sure that we will have exactly the same thing whether you are running a container or virtual machine as agent on cig and kinsayo or trusted that's not uh high priority uh walk the goal is to pair on that so i'm not the only one to master the packer process and that's uh some time to time so the next one is kubernetes 1.20 upgrades uh since digital ocean is delayed because we don't have the credits yet i propose that we put back the upgrade to kubernetes 1.20 as a important topic to treat on the upcoming weeks because 1.19 kubernetes vanilla is now on the fly it's supported for uh upcoming months on both azure and amazon where we have kubernetes cluster running but the goal will be to start upgrading to kubernetes 1.20 so we will be interested to lead that subject in the upcoming weeks so just to add more context on that topic so we have created some iKMD templates that we can use um iKMD.io that we can use when we do an upgrade um and so if you want to look what we did in the past um you can go to Jenkins infrastructure slash documentation and under documentation maintenance kubernetes you see the past upgrades so the idea is to do the same where we plan in advance all the things that may change um and then we just document the procedure so it just gets easier and easier so before upgrading the cluster we need yeah there are a few things that we double check just to not leading the subject does not mean you are doing everything you can lead the subject and ask for someone to work on the implementation ideally we should do that in pair it's just to see if anyone is interested there yeah why not yep okay so that's a good exercise that we're going to see if you can create the documents based on the permission you have um and yeah just just start the process so there is no deadline on that we don't you don't have to start uh after the end of this meeting let's say we could say we could challenge ourselves to upgrade before end of November okay as a nice to have and we can say okay if we have to delay for whatever reason we can completely delay on December let's say the goal is before Christmas but ideally November is that okay for everyone does it seems good yeah let's talk about the dates once we start the document and we have a first view of what we need for the bridge completely uh so so let's say ever you lead the subject I'll be there as you back up if you did anything I volunteer to help on that topic um okay that's all I saw someone I did gdk update in docker image uh yeah yeah so that was me the I saw red hat issued a security advisory for JDK that included a need to update to JDK 8u 3 312 and JDK 8 11 0 13 so so we've it we've updated the docker images uh but their first exercise in doing a release will be with 2.319 and with 2.303.3 security releases so one of my worries was that I would it would be an awkward thing if the docker build process failed and caused Daniel or Vodek problems in the security release on Thursday is there something we should be I should be doing or we should be doing to further assess the risk I've already checked builds on each of those architectures interactively I did not attempt to publish an image though um yeah I just don't see how you could test um okay that's okay so I don't I there is no easy way that come to my mind at the stage so first thing is we can be we can feel a lot safer if we start using these versions on the infrastructure for the builds SDKs so on the agents of CI Jenkins IO and trusted we need to be synchronized on both to avoid building something on CI and when it's really untrusted we it will use different JDK so I was I was less concerned there because I've already updated my agents on my test cluster to use those so I found no problem with with either 8u 312 or 11 013 so so that part I wasn't as concerned about and I was hesitant to say let's change trusted or CI right now because we're we're just before the security release so I was almost of the gee let's leave it alone on those so that we don't risk disrupting for other reasons the security release okay so that means two things we need to validate were uh with the security team and give them okay we used it since one week for instance on the infra to help them assess that it should be safe and the risk is if we don't if we keep we stay conservative that means we might need to create a new LTS release because last time we had to change something on the docker image that was not uh direct Jenkins the last email on the dev mailing this was okay if you want to change whatever environment variable on the official image you need to create a new Jenkins version yes because it won't push an image if there is a really one right right good point okay so so worth a discussion with the security team yes yes uh worth because they need to to understand that if we have to to rebuild a new LTS release next week that might be some time to spend on that so yeah that's worth the discussion with them right now as soon as possible okay so since I'm the guilty party who proposed that pull request I'll let me start that conversation with them and and assure that they're aware that that yes 8u312 did have security things fixed in it uh I don't think any of them affected Jenkins so I don't think there's any real threat to Jenkins but but that way they're aware okay yep and don't hesitate to communicate with them that we are already using most of our Kubernetes agents are using the same Eclipse base image as the official Jenkins control okay so so the agents have already many of the agents have already switched to using the newest release of 11.0.13 or 8u312 great yeah thanks I need I need to double check that I'm not saying crap and that's it's not my imagination uh but uh we merged uh update CLI releases last weekend it has been deployed so let me check um not sure if it's available for trusted CI that that could that should be easy though um poor PC I don't know if they were able to release a poor pc for gdk8 because the previous gdk8 version had a version which is not the case of that one so we need to check that but since we don't have an official Jenkins poor pc image that should be okay right and we don't we don't actually deliver any power pc docker images even so okay so no worries power pc is just not a risk for us right now in that sense okay okay um okay so if you need the subject mark don't hesitate uh if you need backup on that one as well if you are you have a lot of tasks uh since we are all there and aware and we already started working on that we can help the security team if they want great thank you thanks for reporting that sounds like we cover every topic we are slightly over time yep thanks everybody for your time thanks everybody bye bye and see you on RC