 Hi everybody, welcome for this new Jenkins infrastructure meeting today to the agenda, we'll talk about post mortem. In fact, looking at the agenda we have quite a lot of small topic to cover so I think it will be just better if we take them one by one. So the first one before we start is we release a new weekly release today everything went great. And that's great considering the last week issue, but quickly mentioned the last issue in the post mortem. So let's take the first topic which is post mortem, sorry. Sorry Damien, are you okay with the action we had from last week to apply what you learned from the weekly to the LTS, is that settled. Yes, we, I waited for today's release to validate it. And we have applied this on LTS validated by a Libyan team today. So, we just have to merge the PR and that's really great and thanks for taking that initiative because this is typically the kind of small issue that can affect us in LTS releases. So, the first topic that I want to briefly cover is over the last week we had two issues. So last week we had an issue with the weekly release that briefly mentioned from Damien. So what happened was with a recent upgrade of communities cluster. And this cluster just does not read does not handle the same way path in Windows containers. So that's what they meant troubleshoot and identified last week so the fix was quite easy. The thing is, if you want to understand a little bit more and that's where I want to come. Damien wrote a really nice post mortem. And so if you have access to ACAMD, we now started writing post mortem documents. So this is the one mentioning the weekly packaging issues. So, we were able to release every packages except at the Windows one because of the past issue. So, Damien explained everything here and that's really nice. So, if we just look at what. So that document is not available and so and then we push the document on GitHub Jenkins in France documentation. So we started pushing every document that we write on ACAMD on the Git repository. So we really consider the Git repository as a source of truth. So now you have a post mortem directory when we list the different post mortem. So if you're curious about the Windows issue specifically that explained here. So, as I said, we had the issue with the past with Windows and during the investigation Damien trigger a second time the wrong job, which generates a second weekly release. So the second week release was, I mean, at the same contents. We had to run documentation on our site. So if you have some inputs Damien suggested some improvement that we could do in short term medium term and long term. Feel free to participate here. We usually consider one week before after the OTA so people can provide some inputs because yeah it's easier to consider them because once the document has been published on the Git repository. We just go back to that document for specific needs but yeah, that's that's something. Yeah, interesting to notice that Damien is proposing to specific templates for a document where he specified the chronology, the impact of the issue, a bunch of technical elements what went wrong. And finally different improvement that he made. So I think it's a really nice template as is. We had another issue over the weekend and then in mark reuse the same pattern to for the documents. So the document is available again on the post mortem. So that was related to see either Jenkins that I am just a quick, a quick, a quick note for mark. So we can use tag information here directly. And so you can specify a tag so in this case it's a post mortem so it helps us to to identify document directly inside. So you have post mortem. We have maintenance and we have. So what were the directories so we have post mortem meetings maintenance and run books. So we agreed with Damien to also use the tag state to specify if we are open to inputs or not. So the idea is to collect all every feedbacks that we can have. And once we consider the incident close and we don't accept any feedback then we just consider it as close and it just we are just archiving the documentation for future. And the final thing is once we consider a document to be close we just push it so version, and then we can push the document directly to to the git repository. And we really consider the git repository as a source of truth so once we are ready with the documents we publish it, and then we don't go back to it. So we have some improvement to do on this git repository to to simplify the visualization of the content here, because it's only marked on or ask a doctor. But until now I'm quite happy with the amount of documentation that we were able to put here. So for example the meeting notes are published here. Each time after meeting. And Olivier, I, I think I may have initially placed this in the wrong location did you or Damien put it here for me or did, did, did I actually get it in the right location and just get. So, so you wrote, you wrote the document so the only thing that I did was to add the tags so. I specify the state of the document as closed, because okay we don't have we are not waiting for any input anymore. And then what I did is I commit the change so I went to version and get and then push to the git repository. That's one thing that I did. And also an interesting feature that I like from ICMD is the ability to have linter. So for instance, if you put I don't know. You see the red button, it automatically run markdown linter for markdown but it also run. What's the name again to correct. Yeah, I mean it can fix also small boarding and stuff like that. Grammarly or is it Grammarly or. No it's not Grammarly. I am using it on a machine I just can't remember the name. But yeah so that that that that that would that's interesting to. So you have a small button at the bottom of the screen where you can enable spell check and it tells you if a word is correct or not. Yeah, that was so just to come back to the post Martin I really like the fact that you both create first a post Martin document so that was easy to identify what went wrong, and also that you created an incident on the status page. And so when we have an incident on the status page, we don't really have to wait for PR review because the goal is to announce the incident as quickly as possible. And so that's what you did over the weekends and last week. So that was really great. Any question on the put on those two post Martin. Sounds good. The next topic is three nodes. I mean, I will not explain the current situation between three node libera and all the other are seen network on the Jenkins project we started to migrate from three node to libera so we created the equivalent channels and so on. I think from the Jenkins infra point of view we are almost ready to stop using the channel on three nodes. The last, the last element that was not me created was the puppet notification, and I look at the puppet server just before the meeting restart the puppet server and now notification are sent to libera so I think we can all leave the Jenkins infrastructure I don't think that's our responsibility to communicate about the migration of the from three node to libera. But on our side we may create the RC but we may create the infrabattler And we are not all and most of us are now on the libera chat. So, I think we are officially ready and because because even the element that I provide a bridge to libera. No, we can, I mean, everybody, everybody can move. At least I'm ready to disconnect from three nodes that was the last chance that I was waiting for. Any question before we move to the discourse topic. So, okay, the next topic is about discourse. So as we mentioned over the last during the last Jenkins infrastructure meeting. The company behind this course is now sponsoring the Jenkins projects and we are at the moment experimenting with a new platform named community community that Jenkins.io. If you are familiar with discourse, we are really looking for feedbacks. We don't want to communicate to broadly right now because we are still experimenting with first authentication mechanism of people can authenticate on the service. And the way we organize categories. So the first topic is about so regarding authentication we started a discussion on discourse so you can now join this course. And so we are wondering where is that. So the main the main question that are wondering is first, do we rely on discourse to handle the accounts so for instance when someone create an account is username email address and so on everything is stored in this course. So we rely on the third SSO like one the one provided by the Linux Foundation. And if we rely on just on discourse we can enable from discourse integration with GitHub, Google, Twitter, LinkedIn and Facebook. And so the question is if we decide to just rely on discourse, which social network we want to integrate with. I'm not going to to. I don't want to explain my my my opinion on this meeting. But what I want to share is if you have some insights you want to participate to the discussion feel free to join us on community that Jenkins.io provide some arguments about which one we should choose and the one that we shouldn't choose. So that's one one main area. And the second area is regarding categories the way we organize so at the moment we organize a draft. So if you look at the categories we have using Jenkins with subcategories discussion around the Jenkins community with subcategories as well, different ways to contribute. Providing some feedback on the site. So at the moment most of the discussion is on the site feedback because yeah that's the beginning and we are trying to understand how it work. And so if you want to provide some feedback on the categories feel free to join us and to discuss. And if you don't want to participate from discourse, we still have a discussion happening on the meeting list on the deaf meeting list. Any question. Now, so as we can continue. Next topic is about Azure and Azure costs so I look at the previous invoice and we are still above the 10 K. And I spent I mean I spent a little amount of time trying to understand what the cost is. And we definitely have 6000 spent on the foresee the Jenkins that are you, which is quite a lot, mainly used in time in many use in virtual machines and container instances. So I don't know where most of the cost is going. Otherwise, all the other areas, we did not decrease. I was surprised also to see that the communities cluster that we have running an Azure on the cost of that service did not increase did not decrease, but I think that I thought that would, that would be the case because we stopped running mirrors on that cluster so because we stopped putting mirrors on the cluster, I was expecting a smaller network bandwidth cost, but apparently it's not the case anymore so I mean I guess there are other services that are taking and generating traffic of the service. So I should spend more time to understand how we can save some money on that account. But it seems to me that the biggest cost is definitely coming from CI Jenkins that I use so we should better use that service. And is that that we need to set upper bounds on AC Azure container instances. I mean the. I haven't, I don't think it's possible to put limits on the Azure container instances. I don't think there is an option for that. Something that I think we should prioritize and work is so they can configure an EKS cluster on CI Jenkins that I use a fixed amount of nodes. And so maybe we would be able to better control the cost by relying on the company's cluster for our content as a replacement for Azure container instances. That's one way to better control the cost. The thing is that we cannot decrease the amount of workload on that instance. For sure, we have a certain amount of plugin builds and more plugins more core more release we have more builds we will have. There's more about how can we ensure that we see an effect on the code because right now we're still blind about on CI Jenkins IE we don't have dashboard that will help us. I think the information is present on the system based on the amount of metrics and dashboards we have between the cloud dashboard, our graph and etc. But we will need a way to measure to be sure that CI Jenkins IE would directly cause this cost here on Azure and break it, broken down by virtual machines, ACS, because we can try things but we will we are completely blind. So that's why we have an issue. Using static resources means we don't benefit from the auto scaling. So we might be able to control some costs at the cost of queuing the jobs so the QS will be lower for the end users. But maybe we can try experimenting on scaling up the machine instead of scaling horizontally because sometimes having bigger machines means you don't have to pull so much Docker images we can benefit from cash etc. There are different leverages here. So not really sure which one and I'm sure the being able to measure to have a clear measure to see if we applied something that it changed something. That's the part where it's still I'm not still sure what it is. We shared that concern with people from elastic during their demonstration last week. So maybe that could be an interesting thing in terms of observability because they have the amount that Jenkins level. The thing is that the infrastructure only level metrics are hard to help us on that topic. We need a business view, which is from a Jenkins controller point of view how much memory has been consumed from each machine etc. As Damian mentioned, it's quite difficult to identify how to reduce the cost, because it takes several weeks. I mean, several days and definitely several weeks to saw the changes in the Azure portal, because yeah, that's how it works. So it's definitely a difficult topic to identify. The next topic is about release that CI but I think I wanted to cover the post mortem so we can remove that topic. We also covered the outage with ACI container instances that mark do you have your mark maybe you want to provide more insights regarding that outage. I just had to roll back several plugins and Tim Jacob has stated that he'll take a look at it. He wasn't able to duplicate it apparently. So needs more investigation on what's at the root of it. My solution was roll back five plugins and that rollback was successful. Okay, thank you. The next topic observability can be removed because we briefly talk about it during the Azure cost. We haven't worked on it with elastic yet. We still have to organize enough for our meeting with them. I'm still in the agenda. I still plan but we have to work on that. The last topic that I briefly want to cover is archive the Jenkins that I owe to be available over her seeing connection. So this was a prerequisite for sort of for on mirror infrastructure. So just to come to go back to about archive so archive the Jenkins that I owe is a mirror that contain every artifact generated on the Jenkins project since the beginning. So that includes Hudson artifacts. And we couldn't use it from get the Jenkins that I owe because in order to add it as a mirror, we need mirror bits needs either an ftp connection, or a nursing connection in order to collect file metadata, like when the file was changed take some and so on. And so we had to turn very remove archives and Jenkins that I owe from the mirror infrastructure. And so now the archive is available on our sink. At the moment it work with it's only reachable from specific IPs. So we have a list of connection that we allow. So we have a list of connection that we allow. Basically, those are connection are coming from package that origin the Jenkins that I owe the get the Jenkins that I owe, and that's it. So in the future, if a third mirrors want to mirror every files, then you will also be able to to get the files from that specific mirrors. So that that would be helpful for in order to increase number of mirrors. What does that mean is for instance, so if you now look at the output of get the Jenkins that I owe it's here. So this is something that just work before the meeting so it's not really totally it's not really ready yet, but you can see that the mirror is now listed, and it's considered a stone, I have to investigate why. But the idea is that mirror should have a very low priority, and only be used when there are no other available so an example is, if you go to get the Jenkins that I owe. You want to download the package, but the very old one, I mean, I'm pretty sure that it's unlikely that you want to download that specific version. Copying. Your list. For instance for the Hudson underscore 1.300 version at the moment on the archives at Jenkins that I owe us that's fine. I don't know why it's needed. Not, not to download Hudson, but because the mirrors that we provide the mirror maintained by the US us and network can only contains 100 gigabytes of of data, which is around one year of files, which means that we delete files older than around a year and third mirrors have different approach to to to fetch data from us us and network. Either they have a strict copy of the US us and network, or they just keep downloading the fight so sometimes you have fights that are, let's say, when I'm in two years old, but only available from specific mirrors because they don't delete they don't have the policy to delete old files, whereas some other files are only available from where other mirrors only allows you to download a newer version. So that's why it was important to have archives at Jenkins that you are available for older fights. So again, I'm not looking for a sense but fights in between. Any question. So in the past there was concern that archives that Jenkins that I always was not able to handle the load if we fell back to it. Are you as that concern been resolved. So, so it's not it won't be able to handle the loads definitely because the traffic that goes to get the Jenkins that I always in term of terabytes per day so we, I mean, one second machine cannot handle the loads, we still have that on archive, and I really see archive as a way to download all old plugins old plugins version. So, we definitely have to monitor the traffic of archives at Jenkins radio, because when we put mirror beats in place we use archive as a fallback for get the Jenkins radio. And what we saw is when archives at Jenkins radio was used as a fallback. It was not able to handle the load and just crashed. This is not something that we want today. But I think that it should be able we should be able to end up the load. And if it does not, then we can still deploy an additional mirrors elsewhere. So there is a risk that this this experiment you're doing now will show us that, for instance, a brand new plugin is released and falls back to try to get it from archives that Jenkins that I oh and we may have a large demand on archives that Jenkins that No, no, no. So, so the thing is archive is a fallback for older, older version of the plugin. So new plugins are only coming the new. So I'm not sure if you can visualize that configuration now you won't be able to visualize that. But the current mirror, the current fallback for mirrors is to rely on us us a network. So those are. Yes, I know how to visualize this normally should be displayed here. So it's not display here but the two fallback, the two mirror fallback are those two, you know, so FTP dash and New York that us us a network, and the second one. So this is the current default but those one those two mirrors does not have all those older file. And that's where archives comes in. And so It's truly intended only for older files not for new files. Yes. Or, yeah. But what I mean by all the files is five older than one year. And even more. And so the idea is really if, if there is no, I mean, if there, if there, if there are no files available from any of them asked for archives and if, if we rely that if we really realize that we put too much pressure on archives and will deploy an additional machine. That's that's the goal. So we are running out of time and running out of the peak. So do you have any topic that you want to briefly talk here before we finish this meeting. One time sounds good for everybody. Thanks everybody for your time and see you on RC on the other chat.