 We have started Olivia, your meeting. Hi, everybody. Welcome for this new Jenkins infrastructure meeting. We have a few incident to report, so we definitely have to be totaled today. So first, let me share the notes for today. As last week, we are going to use ICMD. So here are the news. If you cannot edit the news, feel free to request access. So let's start. The first thing that I want to talk is the incident that happened yesterday. We get the Jenkins that I own. So as a reminder, get the Jenkins that I own. It's a mirror engine that redirects every request to download an artifact from the Jenkins to a mirror located to the closest mirror from your location, which means that from Europe, you're redirected from to Europe, mirrors, and so on. What happened around 6, around 5 PM UTC? For some reason, the network storage, so the Azure file storage, mounted into that mirror bit service stopped responding. We got error, say, quota exceeded. And so because we could not communicate with the network file storage, basically mirror bits had no idea about which file can be distributed, can be distributed to which mirror. So that's basically what happened. So the fallback, the current fallback, the way it was configured is if that get the Jenkins that I was not working, it fallback to a service that was using the same network file storage. So basically, we just sent way too much request to the Azure file storage, which was problematic at the time. So it took us around two hours, two, three hours to understand the issue. The good thing is that was the same issue that affected us back in November 2020. So we had a rough idea of what to do, where to look at. Several people were involved. That outage, so the first step was to redirect the traffic to a node machine that we named package, Jenkins, and IO, which has every file. So that machine has the same content that what is located on the Azure file storage. So the idea was just to redirect the traffic to a different machine. So we could put that service back and track until we understand what happened. So the service was done. So yesterday evening, Europe time, everything was fine. Yes, sorry. So the redirect was that redirect of get.jankins.io traffic? So it was a DNS change to switch from? Yeah, exactly. So we didn't fix the get.jankins.io. We just redirected traffic to a location that was working. The temporary solution was fine, but that machine is not able to handle the loads that happened the week hours. So it was definitely just a temporary solution until we understand what was happening. So on my side, what I did yesterday was to open a connection using PowerShell with the Azure account to list every open file connection with that Azure file storage. And there is a hard limit to 2,000. And what I identify is there was in one session, we almost filled that limit. So we opened like 1,998 connection around, I'm not sure, Damian, because we investigated with Damian this afternoon around 4 PM, I think, UTC. Yes, 4 PM UTC. It's when we started to see the first peaks. And so once the Azure file storage was not working correctly, then we saw a lot of side issues like CPU usage when crazy inside the nodes, memory increased. Obviously, because the service was not able to answer requests, the number of requests increased as well. So we can clearly identify huge peaks that happened during that time. Until now, we could not identify the root cause of, I mean, who basically opened those files? Was it an issue on Azure? Was it an issue on our side? Did we have one process that just opened every connection in one time? Yeah, it is not something that we could identify right now. We have identified a private IP, but we are missing some observability data to be able to know, because the IP is one of the Kubernetes nodes, and we don't know. We are missing information to conclude, did we cause the Azure file system issue by making too much request, or was it an Azure issue that caused the request to pile and to go and raise condition on whatever application and keeping the file under open? And this we cannot conclude on one or the other, and we miss data to be able to conclude. So that, yeah, that was the main issue. Any question on this? No, so if we are still on that specific issue, I think what was really nice to notice based on our learning back in November, first, we put in place a status page in November. So we could communicate about this issue, and people were able to quickly open the incident on the Jenkins infraslash status kit repository. So we were able to communicate about the incident when it starts, when it closed. That was the first thing, first thing, sorry. Some people ask why the monitoring and the status page did not show that the Jenkins that I was on. The root cause of that is because the way the container is working, the service starts and read in the directory. So read the net or the Azure file storage content into a directory. So the service was working, but could not read the data in a specific directory. So our int check, because we only monitor the route, so get the Jenkins that I you, our int check told us that the service was up and running. But when we try to access a specific file, obviously that was not working. So that was the thing. So we have to improve on monitoring to monitor a specific file. It says slash time slash Jenkins or whatever of file we monitor. Ideally, that file need to be small so we don't put pressure on the network. But we have to improve the monitoring. Something that we saw in November and we saw the same pattern this time was some requests were passing and others not. So we were able to download some specific files, but other were returning five or three hours. In November, we had the same issue where we could access every files except those under the directory please. I didn't understand why at that time the problem resolved by itself. And we saw the same issue yesterday. Some requests were passing, others not. Get the Jenkins that I you was up and running, but so yeah, that's one of the thing. So yeah, I'm not sure what we'll do is we'll just improve the monitoring to monitor a specific file, but this will just reduce, I mean, this will just help us to detect the issue. But in this case, we didn't get the monitoring notification. The second thing that we monitor, that we check was since now while, we monitor that we can download the latest Jenkins version from package at Jenkins.io. So we have a monitoring check for that. And how it works, we query, we put the Jenkins.org to see what's the latest version for the weekly data test. We relieve the retrieved that version and then we query get the Jenkins.io. If the data check is failing for 30 minutes, then it trigger alerts and then we are notified by page activity. What happened here is because the service was not failing for two hours, some requests were passing and others not. So when we look at the metadata dashboard, we clearly see that's half of the request was were working and others not. So that's why we didn't get notified by the monitoring. So we had a look to think about things here, but we could not really, I mean, yeah, that was a tough issue to detect. So what I have to, another thing that we have to improve here, I clean up the open connection. So I just had to connect with Azure console and run a bunch of comments. Obviously in my case, that was easy because the comments that are run back in November were still in my Porsche history. So I just re-execute the same comments, but I should put the documentation in a runbook. So the next time, if it happened again, someone else can do the same. So this is something that we work with Damien to be sure that someone else can run the same comment again in the future. But yeah, any question? So if you have, yeah, I just put here the kind of access that are done on that network storage just to give you an idea. Why you can mount the same network, the same Azure file storage into multiple containers? You can write from multiple containers, you can read from multiple containers. You still have a limit on the number of open files you can have at the same time. And just to give you an idea, we have monitoring check that tests if you can access some specific location on the container. You have the Apache who have an Apache that can return content. We have mirror bits which scan on a regular basis from every containers. We have data doc monitoring we have. So that's, it's really difficult to really have to have a clear understanding of where those access, those file access came from. But yeah, we are still investigating. The next topic that I want to mention, the next outage, I mean, that was not really an outage. So any question before we move on? No, another issue that happened last week. So we wanted to improve the way we deliver check-ins at IO website by directly to only rely on home charts. And we face an interesting challenge here. We put branch protection on the key trip repository that contains check-ins at IO website. So we put branch protection. So we always use pull request to introduce change. And because the new workflow implies committing to the branch, we could not identify a way to say we want to keep the brand protection but only allow a specific bot to modify that specific file. So we had, we had a bunch of discussion here and I'm really open to suggestion. One of them was to just remove the branch protection. So now we just allow the right person to directly commit to master. Or there were also suggestions about keeping the branch protection but automatically open a pull request and automatically close that pull request. So I think I have a proposal here because there were at least six different way of implementing that workflow based on changing some bits, connecting with the meeting. This means that we don't have a consensus right now. So that could be a nice ground to start writing an EOP, which is same thing as the DOP. It's something we haven't done since a long time. So infrastructure enhancement proposal where the goal will be to state the goals, list the solution with each point and cons and then discuss based on that just to be sure that everything is clear for everyone and have a specific meeting and decision. And then we can jump to implementation because it appears that the two tries at implementing it were missing a tiny parts where we discover that maybe we have to go in a consensus or to act somewhere. So that's my proposal here that we start a specific discussion with a written process first, where we list the solution just to avoid a risk of a meeting where we might not understand or see all the parts. What do you think? I would be really happy to work on that documents because four years ago, I work on the same for the current implementation and the current way to deploy things on our websites. And I mean, every other website like Shavadawkin API and PluginSense on and in four years, a lot of things evolve. So yeah, I would be really glad to re-evaluate my assumption that we're done four years ago to propose something different. It was mentioned that maybe IOP could be merged into the GEP. I don't really care. The goal is that we get started on writing the proposal there. If we see that we can do that exercise as a community team for one or two important topic, then we can raise the discussion of should we go to GEP? But the goal is to learn to work before running here. So that's the goal of the proposal of staying on the IOP that hasn't been updated since a lot of time. So let's see how we behave as a community team. And then we can then see if we have to merge to GEP where I will say the writers and readers of GEP are more at ease with that process than we are today. That sounds a good idea. And I would even go one step further since we are working again on documentation git repository to put a lot of documentation. I'm just wondering if we could just put the IOP document there as well. So we have just regroup that the IOP document, the meeting notes, outage, maintenance, and document and so on in one location. Olivia, I'm not sure I understood the last thing that you said, could you? So the idea was place the IOP in a different location than the earlier IOP repository. Yeah. Close to the, okay. So right now we have, so we have a git repository named documentation that was created a long time ago. And the idea was to have a public documentation where we document everything related to infrastructure. Last week we had a discussion in this meeting about should we put ACME documents in Jenkins LA website or in a different location. We agreed that we would push in a different git repository, which is the documentation. And so since we collect the notes for the meetings and the upgrade plan and every other things, including one books, I was just suggesting to move the IOP documents in the same git repository. So we just have a bigger repository with more content. That sounds great to me. Okay. Thanks for the clarity. I'm going to move the front slash documentation. Sounds good. Those were the two most important of the peak that I wanted to talk today. Damien, you were mentioning that you wanted to bring the website to the peak and the issues we have with release at CI in front of CI. So I guess that's the right time. Yep. So as I mentioned before, the recordings were mentioning it for everyone. Yesterday during the incident, while we were investigating the Azure console elements, we saw an alert on the Kubernetes, which was only warning when we run the diagnostic and troubleshoot integrated on the Azure UI. And that warning was about the fact that the Kubernetes control plane of our IKS cluster named public gates is thresholds. That means that we are making too much requests at some times. So I'm having a bad moment with language. Natural threshold is the correct word. It's throttled. That's with a T. So we are throttled. So that means that some of our requests are queued before being sent to avoid having big workload on the API control plane. And these requests come from different sources, mostly all our N file process, which takes care of doing the GitOps operation to the Kubernetes cluster, but also from all the Jenkins instances that are spawning pods. Because when a pipeline is running inside a pod template, it run WebSocket command from a Kubernetes client inside Jenkins to run the kubectl exec command in charge of the pipeline steps SH or B80, depending on Linux or Windows. So these requests are also being sent directly to the Kubernetes API control plane. So we don't know who the culprit is, but this will explain the WebSocket issue we see because WebSocket not only is used between Jenkins and its agents, but also between Jenkins and container that are not the default GNLP container or the given pod when the pipeline is run. For each SH, there is a kubectl exec, which involves one WebSocket. And most of the errors we saw were mostly related to this and are correlated in time with the peaks. So we will have to push this further, maybe monitoring the amount of requests or the peak of builds. I don't know which direction to go from there. That's also something we saw yesterday. And that's all I don't want to have. We need more data to prove that it's related to that. What monitoring can we get out of Azure? Can we ever to see when we're exceeding the thresholds or the quotas on the last? Yeah, there is some monitoring currently integrated in Azure. So we could maybe start with this point. I don't know how to extract that information continuously from Jenkins though. There are way to have a groovy script to run that will print instantaneous usage of the current open connection from Jenkins. However, I saw on the graphs, it looks like that Azure is able to determine the user agent of each request because I saw some requests that were coming from my own web browser with the operating system, the web browser and the kind of client. So maybe that one could help because I suppose the user agent of the Kubernetes client in Jenkins is different than a web browser or a M file apply from a go along. What would be also nice is to identify all the potential limits that we have to use AKS. Because that's a common issue in Azure. Whatever service you use, you have the limits. Like the number of CPU you can deploy in one region and stuff like that. The number of files you can open on the Azure file storage. At the same time. And so maybe we have a limit that we need to identify. And also that could be worth it to check with Jenkins and Jenkins 6 communities and maybe Jenkins user or any variation because I'm sure we are not alone. I mean, we don't make so much request. So is it because Azure AKS is meeting some issues? Is it because of the Kubernetes version? Do we have other user with the same issues on those are Kubernetes kind cluster? Because I mean, it's not uncommon. And we don't do anything that is exotic. We run pipelines on pods and the pod template might have two or sometimes three containers. That's not a lot. So even though there are technical limitation that's also subject from the user experience in Jenkins when using the Kubernetes plugin. Yeah, I'm wondering whether, so I know on Jenkins X, there's a recommendation to turn down Kubernetes external secrets from Poland because that can be quite heavy in terms of listing secrets and looking for changes. But I don't think we're running that on that, thinking of that intro cluster. My, I'm guessing that it's probably all related to the fact that, the fact that we run a full deploy or a full kind of shell. Maybe, maybe, so often. Maybe it'd be nice to reach out to our elastic friends because I know they work on Jenkins observability, so. Yeah, and so right now the first step will be what I'd say the checking the existing monitoring on Azure and see the breakdown between different clients and see which one is emitting a bunch of peaks. And then from there we could go further. It's interesting to investigate and that maybe Damien, you will have to look at, you will get something. Interested. Anyone interested to pair with me is welcome from the team or from outside. Yep, you're right. I can tell you. Sounds great. So one quick add-on. In the action, I move on the action points the run box, I'm gonna polish the writing. The goal is to write run box about to fix some elements that could have helped us yesterday during the incident. The goal is we tried as much as possible to not wake up Olivier who already had the knowledge from the previous one. So we partly failed and partly succeeded because team was able to provide a fallback solution. So I've identified at least two procedures that must be written at all costs. How to fall, what team and Mark did yesterday how to have a temporarily fallback to be sure that user can still download file for one full day slower without the mirror capability but still they can download. And how to identify and fix the Azure file storage. Identify we were able to follow Microsoft online procedure but we were missing some point about PowerShell script that Olivier mentioned earlier. This being should be a run book. I mean, by run book, I mean a no brainer. Only the main dots we can still do ourselves during an incident, the line between the dots but we need, there was some missing element that could have helped and maybe could have helped the fixing without bothering Olivier. So now that we all have the knowledge and understanding of what happened because since it happened already in November that's the second time. That means we will have another issue with Azure file storage. So if we have a run book for that, it's a no brainer. Anyone from the team can fix it and that should also shrunk the time of the issue. Sounds perfect. And again, we'll put those, those do run book in the Jenkins for documentation, we'll be starting. Is it worth one more action item to investigate or suggest or discuss ways to detect that style of failure? The failure mode was rather strange in terms of flapping and, you know, it was on again, off again, some here, some there. I'm not sure that failure detection is ultimately possible for that kind of failure but is it worth? So. Yeah. To be honest, identifying this issue that happened yesterday is definitely challenging because as you said, as you said, that was flapping. When we look at the monitoring, sometimes the check were passing, sometimes that's an even worse. We have 400 gigabytes of files. Some files were accessible, others not. And in November, those files were under the director plugins. This time it was not necessarily the same. What do you think Olivier may be putting, I don't know if it's technically possible but the amount of file under open on Azure file storage where a pretty good indicator that something went wrong because it's the quota we reached. And once the quota was reached, then everything got out. So do you think it's possible to have a routine that only that's a measure at least the amount of open file under on that specific volume. So if it reached, let's say it's 2000. So if it reached, let's say 8600, then we have an alert that say, okay, maybe you should check and with the run book with the process. So a human still have to operate but the human would not have to go through the which is the issue on the Kubernetes implication. We know that the issue was specifically on the Azure file storage there. And not on the Azure. So first thing first, everything is possible depending on which time we want to invest in there. Practically, I'm not sure if it's worthwhile because Python, the checks are published to DataDoc. They're written in Python. I don't have any documentation. So we could probably use the Python SDK for that. But those are definitely not information. I don't think those information are available as is in DataDoc. And so we definitely have to collect those and publish them to DataDoc ourselves. Would you be okay, Olivier? I think there's some interest there. If I ask DataDoc, do they already have a built-in monitor somewhere that would check Azure file storage issues? Because I would expect this to be a common thing they've already implemented. And all we would need to do is find out how they, what they did. So we definitely have storage information. Storage, file counts. I have to do a check. I mean, we have information like the math, egress, egress, and stuff like that. But we have the latency, but we don't have the information that... Yeah, I have to look at. Okay, I have to... Thanks. So we are running out of time. So I propose to finish the meeting here. But before we do, I just want to highlight the fact that because of the new workflow, we are going to have one document permitting. So I put at the end of this document the link to the next week meeting. So if you have anything you want to put to the agenda, feel free to add that information there. And so yeah, we use the next document next week. Thanks for your time. Have a great day. Bye-bye.