 Hi everybody, so the recording is on now and welcome to this new Jenkins infrastructure meeting. Considering the week, we have a few things, so yeah, the week has been pretty interesting. Basically, everything started last week when we had an OTH with the AKS cluster. So the cluster was in the broken state since weeks. I just tried to upgrade the cluster to see if it would work. And so basically, what it leads to just all nodes were disconnected from the load balancer. So basically, it was not possible to use a cluster anymore. And so we investigate several solutions and none of them were working. So we decided to delete the cluster and restore everything, which was also interesting. In multiple ways, because we discovered some issues. And one of them was related to the ADAPT database. So basically what we discovered is around February, the volume that stored backups switched in a redundant mode. So we had no backups for that database since February. And so basically, we weren't able to restore all the users. So the first focus was first to restore all the services on that cluster. And now we are investigating what are the multiple issues with the ADAPT users. So first, do you have any questions regarding the AKS cluster? You want to add on it? Yeah, so it would be interesting to know more about what we monitor on the cluster. And especially how to better contribute to monitoring in the future. Yeah, it can be discussed later if there are guidelines which I need to read first. I think the cluster was running, but whenever any changes would try to be made, there was issues. So we hadn't been able to run the deployment job for the last couple of weeks because it kept failing. And then I think Olivia was trying to do updates to the ingress records in preparation for the next version. And it kind of was not working properly from what I understand. Yeah, so basically that's what happens. The cluster is monitored. We are using that to monitor the cluster. Every service is running on the cluster is monitored. One of the things that we could have better monitored is that database. But I mean, I'm looking at what happened with the cluster. And I'm not sure if we could have worked on it in a better way. So except having more clusters spread the risk on multiple clusters. But otherwise, that was a weird issue. Something that I also tried was to open a ticket on Azure. But there are two things. First, it would have taken more than one hour and more than few hours to solve the issue. And then we don't have the support anymore with Azure. And if we want to have the technical support, you have to pay for that. So this is also something that we discover. I'm not sure that we really need that. I think we just had to clean up everything. So that cluster was running since two years, something that had now since one year. And since one year, sorry, that one was running for a year. And on a monitoring point of view, I don't think that we could have caught this. No, it's not the control plane level, it's on the Microsoft side. So we don't have access to what was broken. Yeah, and if you use other backup services, for example, if you connected to Valerais directly instead of really on Microsoft. So just to come back to the cluster, I'm not talking about the backups. The micro cross of the cluster that we are using is called AKS. And basically it's a managed cluster by Azure. So basically they manage the master. And the only thing that we have a control is the agents. And so when we want to use an updated version of Kubernetes, we just ask Microsoft to update the version to the next one that we want to use. And Microsoft will do the management to turn the machine down and so on. So we don't have the visibility there. And basically what happened here is we had a lot of time-out issues between the agent and the master. So it was not possible to, let's say, association the machine, the bugs, see what's happening with the ATCD or whatever. It was just like a black box to us. So that's why I tried to upgrade, hoping that upgrading the version, even to a minor version, would just restart the nodes on Azure sites. That's one of the things that we tried. I was also hoping that maybe Azure support would help us in this case, but it was not the case. So that's why we just said we decided to just delete everything because it's also at that time that it was the quicker solution to our problem. I did not realize at that time that backup was not done anymore. So basically the way that backup is done is each time. So every day, there is a crunch up that dump the database on an Azure file storage. And each time we stop the container, we also generate a backup on the Azure storage. And that Azure storage is replicated in multiple regions. So there is no reason that the backup would be gone. And so basically what happened here, we just mounted that Azure file storage in read-only mode instead of read-writes. And maybe something that we need here is a monitoring job that just say, it sounds like your backup is quite old. It's older than one week or two weeks or one day today, whatever. This is something that we could have seen earlier, but yeah. So something that you have to keep in mind is that ADAP service is running on Kubernetes since three years. Multiple times we had to turn to move the cluster of the container into a multiple cluster because we upgraded the cluster within multiple operations. It was always almost transparent because the cluster was done for a few seconds and we were able to backup and restore very quickly. So this was really the first time that we had such issues with ADAP write-ups. So yeah, I started preparing retrospective for the data base. Yeah, and I guess setting up some kind of monitoring and all of the critical data backups would be quite high on our list. Yep, so few things happened, but definitely we could prevent it in the future. Yep, and otherwise, there are multiple things that I still have to report in retrospective for the Kubernetes outage. But we had a bunch of issues like, for example, we generate a public IP, so we generate an Azure public IP that we can reuse, that we can assign to multiple machines. And so this is something that we use for why. And for example, the kind of issues that we had was, we weren't able to reuse the public IP that we generate in the past to the new Kubernetes cluster because things changed on Azure site. And so we had to delete that public IP, generate a new one and assign it to the load balancer. And so those kind of issues were really small, but it meant, for example, it implied DNS changes. So we had to wait for up yours before the DNS was totally propagated and stuff like this. And so I have a list of things that I changed. Everything has been pushed to the Jenkins Infra slash chat repository, most of the changes, but yeah, we still have to do retrospective once everything is totally fixed with regarding that outage. So I propose to move to the next topic, which is adapt database issues. So, yeah. Before you exit backup, are there any other services where we should be checking short-term that we've got a good backup and may not have? Yeah, I don't know. The adapt database is the only stateful application storing data on the cluster. So this is the only one that is at risk. Otherwise, every other services have been reviewed and they're all working now. So, yeah. Yeah, so Jira and Confluencer backups, they might, Jira and Confluencer backups and MySQL and MySQL backup goes where at the moment? So all the services running on Bermuda machines like Jira Confluencer, they're just backuped on the machine. So basically we lose the machine, we lose the backup. So we don't have any backup policy for the other services. One way to move to have issues. Okay, so that feels like a separate topic we ought to flag that it's worth us considering should we put something that's not on machine as a backup destination for the Jira issues database. Confluencer's time less worried about, but Jira. Confluencer, it's Confluencer, it's Confluencer, it's Confluencer. Right. Confluencer, it's a backup, Jira, I guess, but another one that would seriously need to be backup this package, Jean-Jean Kizorayo. Yeah. So the thing is, yeah, the thing is we have quite a lot of services where we can retrieve some data like we did with under that in Jira right now. But because we don't have a defined backup policy at the moment, we are still at risk on the other services. So it's really deep and on the service. It should be quite cheap to back it up to a Jira, or AWS I would assume. If you never access, you normally pay for transfer. If you never transfering, just really need to setting up. So I don't think it's really important. Yeah, about Jira and Confluence, would it be possible to make a one-time snapshot for the backup? So yeah, it is all setting up a pipeline, at least that we have a recovery point. Oh, we have, yeah. We, at least on Jira, we have backups that are taken, I believe it's every week. But what I think Olivia's point was is they're stored on the same machine. Yes, so it's on the same machine. That's the problem. Yeah, that's why I'm asking whether we could have a snapshot for that. So for example, for Wiki, we can just keep one snapshot forever because we don't expect any changes to happen on Wiki now. I mean, the ones that cannot lose. And for Jira, it would be still useful in general. So that if anything happens, we at least have one snapshot of a more or less relevant version where we have this historic data. And yeah, we could think what we do with that next, but yeah. So, yeah, I think we just have to find a place, maybe provision a storage a lot and let's say on Azure or Amazon, whatever. And we just put it back in there. So this is the thing we're going to do. Yeah, so I will follow and take it for that. And I guess for Package of Chickens, I would say it's a bit less trivial, but for that we at least have a mirrors. Package of Chickens, I shouldn't be too bad because most of them are in Azure storage already. So all of the DB and then the red hat one and definitely all of them. That's what I'm saying by, we don't have a real backup policy, but most of the data are kind of already duplicated into multiple locations. So for example, you mentioned Package of Chickens at you. Yes, you're right. Most of the packages are stored on Azure, but for example, you don't have the very old version like before we start to upload. So basically for the release line that are not used anymore, same for example, for some packages, that's what I mean by most of the data are can be retrieved in some way, but to really depend on the services and all the data. So for example, for mirrors do not mirror everything. I think the limit is 100 gigabytes, and otherwise everything is just removed from the mirror on a regular basis. So there is just one mirror called archive.chickens.di.arch that contains everything. So that, yeah. We don't have a backup policy, but if we lose our machine, it's not on the end of the world, but we should have a backup policy. So in principle, if you wanted to do a cold backup, so basically just again, historical snapshot, which we put somewhere where it's not that expensive, what would be the effort to do that? Just that sync? I think the first time would just be to call a nesting comments on the different machine, that's it. So if we don't have to, let's say, if we just want to do it one snapshot today to be sure, it's quick. If we just want to put in place some script and to do that on a regular basis, then you have to work on the script and you have to work on the monitoring as well to be sure that you're doing backups and also that you are able to restore your backups that you don't have corrupted data and so that's a different story. Yeah, I think that a one-time snapshot for Confluence and for packaging, can say, is more than enough. For Jira, it's a bit less trivial, but it's a much bigger story. So just having this snapshot at least for now would be great. For Jira, we have some sensitive data there, let's say, security project and other things that we wouldn't like to lose to this data just in case. But yeah, for the rest, there would be a second team. It would be a good opportunity to reconsider how we do that if we lose this data. It's still not cool. Thank you. So yeah, let's go back to the LDAP database. So basically what we did here. So something that you have to understand is you have, so yeah, we use LDAP as a source of identity, but you also have multiple services that synchronize the database in their service and so keep a local version of the users. And so basically what happened here is we lost the LDAP database, so someone is able to create a new account because the account is registering the LDAP database and because it created that new user accounts, it can now access the multiple services that use that user accounts. So that was the risk I let in here. And that's basically, we initially reviewed for all administrators like me or Lake and there at the risk was new. But we realized that it was a different story for plugin maintainers because we had something like 100, 1007 and so hundreds of plugin maintainers. And so basically what we did is we fetch the list of user from repotagenkinsia.org. Then we compare that list of user with the G-RAD database. We retrieved the user name, the email address and all the information that we could and then we recreated the different user into the LDAP database. So that's the current state now. So, you know, and now we are looking at how to restore the people who created an account but does not have an admin access. I checked just before the meeting and it's around 9,000 user that were removed from the LDAP database. And so now we have to bring them back into the LDAP database. And so for that we have to write a script for doing this. But yeah, that's kind of the current state at the moment. So it's a safe assumption that releases stay blocked until the next week, at least, right? I think it's a good assumption. I blocked the, I blocked, so nobody can register in your account now. So basically I blocked it until tomorrow and I hope tomorrow to be done with all the users into the LDAP database. Yeah. So maybe. Sorry. Yeah, just wanted to. So we have some super users. For example, Kiki or Jessica, who should be able to release components now use the current setup permissions. And yeah, probably we could start adding some contributors to the allowed list, maybe additional resource or whatever. So that we can partially restore the plot permissions for accounts for us for that. I don't think it's an end of the world if we delay this general release permissions if you have a partial workaround. Yeah. And otherwise if someone really need to release something, I mean, there is still options to release a plugin. Just like right now it's easier to just block everything for everybody. So we're sure that we're gonna change the situation while we're fixing it. We have identified the, I think 50 accounts that are maintainer accounts that we no longer have or that have been recreated. So what we can do is we can remove permissions from these accounts in the permission files and repository permissions updater and just restore the old behavior because we know which accounts are potentially compromised or should not have access, I mean, and everyone else is fine. So we can do that. Basically, you revert my patch after we make these people no longer maintainers, essentially, of the components they registered. And then we can go through them, communicate with them with the email addresses we have, for example, in JIRA. Recreate their accounts, tell them they need to request a new password and that's about it. Or we just start by resetting any email addresses on my job, reset the passwords and that's basically the same thing. But the vast majority of accounts are fine and we know they're fine because they are in the, they existed before February. So I don't think we need to reintroduce the super user problem that we had with Kate and Jesse, which I think I even got rid of. Okay, so yeah, if we don't reintroduce it, it's fine. You just need to provide a way or partial way. So we have options. Okay, great. Is there any other question regarding sounds rights? So I guess we can move to the last big question, the work being done on the automated release. Obviously, I did not have the time to work on this over the last week, but basically just before the outage happened, I just merged a major PR where we could directly release stable security, we could release directly from the release environment. For the stable, I think it's ready, but before releasing stable, I would like to be sure that we can use the security one. And yeah, right now for the security releases, I'm looking away to test that the full process is working correctly. So that's the current state. But I think it's more topic for Daniel. Sorry, what was the question? It's not a question, it's just like, I was just saying like, we have to sit together how to test it and to validate that the process is working for you. But I mean, in principle, what we can do is once the, arguments are introduced that make the entire thing configurable, we can set up an environment where we would release a weekly release as if it were a security update and whether that works. Or we just create new repos in for Maven and for Git and pretend there's a security update happening, we can do either of these things. So yeah. So basically this is something that we could test now. So right now we have two jobs, the one that release, that use Maven release plugin and the one that package everything. And we also promote, we can also promote artifact at the end of the release, at the end of the packaging. And so everything is parametrized now. So we just have to sit together to test the work flow and see if it's working. Well, so it sounds like if we have time, we could release the next weekly from the security one, just do a fake security release, which is really just a real weekly release. Yeah, the problem that week security release requires release from the old release infrastructure. So we cannot use release caging to say for that, right? No, it's really for testing the security flow. So basically what Tim is suggesting is instead of creating a weekly release on Monday, we create a security release based on weekly content. So we just fetch the data from the Jenkins Airmaster branch. We don't have, we do not introduce security or whatever. It's just like we test like it would be on the security. And for my kind of, as kindness to me, I'd appreciate it for Tuesday rather than Monday that we did that release. But Monday, if we must do it on Monday, we'll figure out a way to do it. So let's just, we can do it on Tuesday. It was just a confusion from me because I just missed the email where we said, okay, we are going to do the release on Tuesday. And so initially it puts a crunch up on Monday and that was not taken into account yesterday. So we did the release yesterday, but it will not happen anymore. So we can do it on Tuesday. I think we prefer that as well. So I guess it sounds like we can, I think we're standing to see how we can trigger, we can do the next weekly with the security workflow. We also don't need to solve the issue with release permissions then. Yeah. And another stuff that I also add in the release environment is I know how it's a folder called components and on the components, we can now release the remoting components. Yeah, so we can use the code signing certificate for the remoting components. So the next release will happen from that environment. And I know the people who were interested to use a code signing certificate to sign components. So yeah, that's the first, the first, that's one addition I changed to the release environment. That's the important security fixes or would they still need to be released as before or saved? What do you mean? Are you talking about the core release or the remoting release? Remoting, but remoting is a component we deliver as part of Jenkins core and there have been security fixes in Jenkins core that were actually in remoting before. So how would they be handled? That's a good question. Yeah, I have to check. To me, I think we can, yeah, I have to read. I don't think it's right, no. Okay. I mean, every component that goes into Jenkins core could be part of a core security update and we need to be staged and then figure out how to include it in the core through the POM or whatever and then we stage core as a security update. So this gets annoying pretty quick. So basically, since for the remoting component I just use most of the what we do for Jenkins core except that I remove some parts. So I didn't expect it to be a big work here. It's just like, it was more proof of concept to see if we can release the remoting. So if anything that we need is to introduce staging environments, that's not a big deal. So last question on automated core releases. I'm just gonna mention that we've got two issues on the weekly at the moment with the release process. One is that Windows is broken due to a Microsoft security update and we need to rebuild our images but I think we're having issues with that. And the other is that if the packaging fails it seems that we get the metadata uploaded but not the packages uploaded to Azure. So both Ubuntu and both the DB and the Red Hat line both start failing even though they shouldn't. So regarding the issue with the DB that is not totally published, this is something weird to me right now because the way the packages are happening is you have the DB happening at the same time than as N2S use Windows and it's also published or any for some reason one of them is broken. It will finish the full steps. For example, DB should be published and there is another thing that need to be improved in the release process is at the end of the packaging process we synchronize the different mirrors so we trigger a script but that script is also a trigger based on a crunch up and there's a lot of things not only synchronizing mirrors and so right now even if the Windows packaging is broken it should still be published the DB and the Red Hat and so on. So for the metadata is getting updated but the packages aren't getting uploaded to Azure and I don't know why without being up to see what's on that machine and what's going on but from the Jenkins job, you can just get 404s back for Azure Blob Store Euro. This has been painful because the last two weeks they released the sales because of Windows which means we've had like a day or two where users have been complaining that they can't download Jenkins. I should be able to push an updated Windows image today for the that's used and then I'm gonna look at the idea of a separate VS tools image from the normal JNLP or inbound agent. So hopefully that'll make it easier to update the inbound agent separate from the VS tools. So yeah, just some quick updates on the Windows packaging. So the package make up is failing since we upgrade the cluster and the reason to that is because we were using old version of Windows on the old cluster and since we upgrade the cluster we now have up to that version of Windows and the nodes and there was a security issue with VS tools in old version of Windows and in order to fix that security issue they introduced breaking change and so now the old VS tools does not work with the new Windows nodes. So that's one of the thing and the other issue which is related to Windows but also to the infrastructure is when we started using Windows nodes for the release process we had one big image containing JNLP and VS tools obviously and we had also to put in place specific infrastructure levels on that specific nodes. And so now that we upgrade every Jenkins instances so we have created the Jenkins charts that we are using to infrastructure and son we are putting a lot of logic in that specific Windows container in order to be able to work in our infrastructure and this does not scale and it's also difficult to test number of machines because we have to do a lot of specific configuration changes in order to test the Windows packaging. So that's why we really need a new container for Windows. We are running out of time so any last topic that you want to discuss here otherwise we can switch in our scene. I don't know if you saw that the news regarding the plugin site, so nice improvements. So we can now have access to the issues and the release information for every plugin. I don't know if we are planning to also have issues coming from GitHub issues or is it supposed to work? We have this request for three. So I report that there is an enhancement because we de facto have components like Jenkins configuration as code using GitHub issues and right now it shows GD which is well not that relevant. I'm not sure what still needs to be implemented for that. I believe we need some update center metadata so there was full request from team for that and after that we will need to apply some magic to get it posted but yeah, the rendering is already there. I don't think that issues and the change looks are rendered much differently. But let's see. I've commented that it would be useful if a change log files were recognized especially if the releases would be empty otherwise. And the results in issue for it as well. So we'll get to the plugin site and there is GitHub issues for that. Okay, nice. Yeah. I will post it in the data chat or elsewhere. But yeah, it's a really great improvement. So let's call for the peak otherwise. I propose to stop the meeting here and to go back to our team. One time, two time, three time. Bye bye, thanks for your time and see you later. Bye.