 Here we go. Everyone, welcome to the Jenkins Infrastructure Weekly Team Meeting. We are the third of May, 2022. So today, we have Mark White is there. So we have Stéphane Merlin. We have Bezel. Is your name Crow? I am pronouncing it correctly. Yeah, that's right. Bezel Crow. OK, I'm still having issues to have the correct intonation for Bezel. No worries. I'm really sorry if I'll drop my vest. And we have one or two D to put in Bruno. Two D, please. OK, we have Bruno Vartan, Damian DuPortal, Stéphane Merlin, Bezel. Herve is off and Mark won't be able to join. First of all, announcements. So today is the day of the weekly release. The weekly release failed to happen because we have another service credential that expired a few days ago, four or five days ago. And it failed because that credential is required to retrieve the GPG key used to sign the packages. I don't know the exact. I'm not at ease with the release process on that area. I don't know because there might be issue if the release has been already performed by Maven and it's the method that have been pushed to GIFRG Artifactory. As I remember, once performed, it downloaded and the wire signed it and uploaded it again. But my memory might betray me. So I haven't spent too much time just so that just before this meeting. So I don't know. It has to be diagnosed. An issue has been opened on the help desk. And Tim and Mark have been pinged on that area. So we'll delegate that part to them. If they are not able to fix it until tomorrow, then we will take over and trigger a new release tomorrow morning. But the risk is that the new release might have an incremented version number. That could be the 347 instead. So ideally, we should avoid jumping one version. But if they cannot and they are not available, we trigger a new one. And that's not that much an issue, but always better. Failed because expired. Credential. LTS baseline selection. For so that one. I let you type. I don't know what type ping is it you, Stefan? It's me. I try to help you. OK, nice. So let's get started on other announcements unless you have some. OK, so then let's proceed. A quick note on what you see on the screen, the task that we were able to finish successfully during the past milestone. So we were able to fix the missing IRM and CPUZ images. We didn't fix the SSO for crowding Jenkins I.O. As discussed last week, the decision, the collegial decision was we don't want to connect the new service crowding Jenkins I.O. to the actual LDAP. So that might be GitHub SSO. And that's another area because Jenkins infrastructure team are not admin of the Jenkins CI organization on GitHub. And it has been closed as we won't fix. We were able to finish the upgrade campaign for Kubernetes 1.21. So that happened last Wednesday on the AKS cluster. That were the two missing clusters. Outage of two minutes for the LDAP because it's not idea available. That's an LDAP. Time for the pod to restart on the new machine. We were able to upgrade all the adoption GDK 17 and 11. We have an issue for tracking the 17 while the 11 was done automatically by the update CLI. So we didn't create it an issue just for that. But the GDK 17 was fixing severity. That's why we created an issue. This version are now available on all the CI Jenkins I.O agents and also on the tools of the CI Jenkins I.O system. If there is an issue, please open an help desk. We can totally roll back the virtual machine and our container templates. We also had an issue with the web UI on GFrog Artifactory. The web UI wasn't available while the Artifactory was still okay. So we contacted GFrog who fixed the issue. However, that on the line, we are missing a monitoring element. We need to add a new web UI probe. So a new issue has been created to tackle down because we were only monitoring the backend system that may even use for downloading or uploading artifacts. But still the web UI provider and user feature, some end user want to search the artifacts. They need the web UI and they need to be able to log through that service. So we need to monitor that. So we can avoid having the service being done for at least three days. So thanks for all the users who were patient enough and let us know. So that's clearly an improvement area for us because we were really bad on that area. Finally, we were able to finish and close the issue regarding Docker Hub credentials. So now all the controller have their own set of credentials and we have split pull and push to increase the security of all the Docker Hub accounts. We have another issue ongoing about being onboarded on the Docker open source program that will increase the rate limit of our container. They accept it and now they are doing technical stuff. They should come back to us, but they are still quite busy. So thanks Stefan for doing the heavy lifting on the pipeline library there. And we documented all the accounts which was an information missing a lot. That documentation is not public. That should be, but we weren't focused on public or private. We were focused on writing it somewhere. And then we can decide if we move it to public, private in a second step. Now on the work in progress elements, I propose to start with the CIGENK INSIO outages. So initially we had container agent in a degraded state that was an issue that basically opened 12 days ago. If I'm not mistaken, we have a postmortem to run on that area. I've proposed the date time Wednesday. So tomorrow. I don't know if it fits your scheduled folks. Yeah, that's fine. Yes, cool. It made me late for me. Yes, because end of day and yeah, when they have to know for the kiddos. Mark answered. And I don't know for RV but RV is back tomorrow. That will be recorded. We will take public notes and put the notes back on the public issue. Also, there has been an issue on CIGENK INSIO caught by at least Mark and a few other person during the weekend. So it was night for the European. So most of the US people were affected while the European were sleeping. And last night also, so yesterday for our American friends, there has been an issue. So thanks, Basil, for jumping on that with Mark. It appeared that you were able to take some flame graph and started some diagnosis. So that makes it a top priority now to focus on CIGENK INSIO in the upcoming weeks. And one of the first tasks we need to grant you access, Basil, as we said last week, but I'm the culprit. I was late this week, so sorry for that. But we need to grant you access in that area. Yeah, sure. Yeah, to summarize what Mark and I looked at yesterday, the CPU was saturated on all cores of this controller. And when we looked at the flame graph, we saw something unexpected, which was that the JVM was spending a lot of time doing compilation of byte code into native code, about 30% of the CPU time during a two-minute sample that we took. And then during the rest of that time, it was executing Java code, mostly either get cloning of pipeline-shared libraries or compilation of those pipeline-shared libraries. Now, whether or not the, and by compilation, I mean the groovy Java code that compiles the pipeline-shared libraries and parses the abstract syntax tree, et cetera. So whether or not that groovy compilation was related to the hotspot compilation, I'm not completely sure, but it's possible that they're related. And in any case, the get cloning was about one-third of the CPU time. And as far as I could tell, the memory usage looked pretty good. So the only problem I could see was that the CPU was saturated and it was almost exclusively doing something with pipeline-shared libraries. But that's all that we were able to determine in the hour or so that we spent looking at this. We did capture a flame graph, a thread dump and a heap dump, which I think should be part of the runbook for any incident. And I can't remember if we captured the Jenkins logs, but I hope we did. And if not, then we should have. Yep. Good point. I had a question on that area that might or might not be related, but yesterday we did a change on the shared library configuration. We added an exclusion to the caching rule because the system is expected to cache the shared library. So that's why the get clone part for the... I'm not sure if I understand correctly. Do you remember if it was get clone of shared library itself or get clone of the repository? Of the repositories being built by the projects? Oh, it was a clone of the shared library. So I wonder then if the change that we did could not be related, let me get the issue there if I can remember it. So, no, it was on Jenkins Infra. So what we did is that we were working with Stefan on trying to test on real life before merging a pull request on the pipeline library. And after running the unit test and running some end-to-end test on our area, we wanted to run it on CI Jenkins.io on real life. In order to do that, we had it on the top of the pipeline, the annotation at library pointing to the git reference of the pull request, pull slash number of the pull request slash head. We do that usually, except since it cashed on CI Jenkins.io, we added a temporary exception that then we persisted on the configuration as code. For every branches name in pull slash something, then they should not be cashed because it's only a few jobs and only a few edge cases. That's a change we did yesterday, but since it's on the same area, I wonder if it could or could not be related. I have no proof of both sites, but that's a change that happened on the system. So that's why I'm mentioning it on the audit log there. Yeah, I think this caching feature is relatively new. So there may be some edge cases or some bugs that have not been fully resolved yet in this caching implementation. There has been a discussion on the close pull request. So let me add the link. I'm adding the link on the issue. Might or might not be related. So it appeared that Jesse Glick and Timi Acombe told us after it was released, since I mentioned that one, that we shouldn't have had the issue that require us to exclude the pull slash, because it looks like that the feature is always trying to get the latest reference and then it's only the git clone step which is expected to be cashed or not. So now, okay, since we pushed new commit to the pipeline library in the pull request, the system should have detected the new commit and said, okay, it's a new one. I should clone it, which definitely did not happen. We were always using the initial test with it, even if we were pushing changes. So that's the reason initially why we added the exception. I'm not sure if as a safety feature we should maybe roll back that one. I'm not sure how do you feel about that, folks? Because I don't have any facts that could let me know that's the problem, so let's roll back. But since it's the same area, gut feeling. I would keep it just to make sure it's coming from there. Yeah, I mean, I think it's good to do some analysis, like you said, and I'm planning to look into the flame graph a little more closely and see if I can come to any conclusions. But the comment that was made about the caching, should, about how the caching should have done this, should have not forced you to make this manual change in the first place. I think that's a legitimate comment. It does sound like a bug in this caching feature if it didn't work in this use case out of the box. So I would encourage you to raise that bug with the maintainers of the pipeline shared libraries plug-in, and I think the best way to do that if you could come up with a simple reproducible test case and file a GERA ticket, that would be the best way to start. And that's list of steps to reproduce could then be turned into an integration test. And that would be the first part of the process toward fixing the bug in the pipeline shared libraries plug-in, if that can be shown in a set of manual steps and later on an integration test, then fixing the problem would be the logical step after that. And that's something that we could take to the pipeline maintainers, or even the person who added the caching feature, because they might not be aware of this particular deficiency, but if we can show them with a ticket and a set of steps to reproduce, then that might be a good way to motivate a fix for this problem. Yeah, I agree on that part. That's a matter of finding a repetition case. My personal measurement when it comes to Jenkins is that it takes me two hours to find a representation case if it's only one line of pipeline. That's the amount of time that it takes me. So I'm totally willing to do that. And we agree that it helps, but that's always the, I always have mixed feeling about asking user for a repetition case because it's not always easy. So that's a time investment. We do this because we are the Jenkins community, so no problem, but yeah, that might be hard for other kind of users. Yeah, it's always better to collect all of the state for post-mortem analysis at the time of failure. And I don't think we do a great job of that overall in the Jenkins project. I think that's something we could improve on in a wide variety of areas. In this case, the way to reproduce this sounds like you would need an already cached shared library, and then you would need to make an update and a pull request and demonstrate that the cached version is used rather than- It's more than that. You have to have a pull request of a pipeline library that, and that's pull request is already in cached. And then you change your pull request. You push and the pull request have changed, but then the update is not done in the cache. So it's not just the pipeline library not updated in the cache. That's the pull request that we use as a pipeline library temporarily with the at library command instruction that's to be cleaned in the cache. It's really on the age of it. Understood. So yeah, there's two levels to reproducing this. Yeah, I mean, that's probably that complexity is probably the reason this bug has not been caught in the first place. But if that can be documented and tested, then I think there's a solution that could be developed without too much difficulty. It sounds like just yet another if statement or an edge case that could be handled. Yeah, that's a good tip. Stefan, I think that will be worth an exercise that we do this in pair, because since you asked to learn more and more about being able to spawn Jenkins instances, that could be a great exercise to produce partial reproduction on a local instance for you. So you will get at ease with how to spin up Jenkins and do FMR all setups like this one to debug. And that will be a worth it investment of these two hours of time. And it will produce something. It's two hours for you. It's two days for me. So we have to mix together to lower the two days. No problem. The goal is to have a valuable investment of that time. Knowledge sharing is always a valuable investment on that area. Yeah. Many thanks for the tips and jumping in. Do we have other points? In general, with newly released features like this, I think being the first, being one of the early adopters is going to increase the likelihood of encountering these types of bugs. So that's something that you now want to consider as far as the planning goes. If you're going to adopt a new feature, I think that's great. And certainly, if it's released on the Update Center, then Jenkins users are going to be adopting this feature. So it should work. But there's just an increased likelihood. So for example, if you're having a busy sprint and you have a lot of other things to do, might not be a good time to adopt a new feature. So I'm pointing that out just in case you didn't realize that this pipeline shared library cache was a recent addition. I think, well, it's not very recent, but I think it was added about six months ago or something like that. But it's good that we are the one who pinpoint those problems and deal with them. So that's fine. Can I just add something? I would love to know exactly which process you're using to do the Flamm graph and all the bug stuff. Even if it's not kind of my job, I would love to know how you do that. Sure, sure. And I would encourage you to do that kind of analysis anytime there's an incident. I could write something down. I have written these kinds of runbooks in the past. So I'm happy to write one describing what I did in this case. I saw that Mark was referencing some kind of runbook and I don't know what he was referencing, but if I can find that, I'll be happy to add some additional description of what I did. Basically what I did at a high level was I downloaded Async Profiler, which is an open source Java profiling tool. And essentially I ran it on the host, Mark and I ran it together because he had access to the box. So we ran this Async Profiler on the host. What it does is it finds the Java process inside of the container and then collects stack traces every couple of milliseconds. I think it's every 1,000 milliseconds by default or something like that. And what this tool does is collects these stack traces every tick interval, every 1,000 milliseconds and does that for a long period of time, like 30 seconds or for two minutes and then sorts the stack traces and creates a visualization of the stack traces that were the hottest during that time period. So that's what we were looking at. And this tool covers both Java stack traces as well as native stack traces. So that we were able to see the C++ code in the JVM that was hot, which was the compilation that I mentioned. And this was not relevant yesterday, but the tool also shows the kernel side of the stack trace. So for example, of Java code is calling open to read a file. And then the open is calling ex-t4 lookup to read the file from the ex-t4 file system. It'll show you that as well. So it's a very useful tool for doing this kind of analysis. And it's not very difficult to set it up and use it. It does require, you need to have things like JVM debug symbols, but our Docker image for Jenkins already has those. So fortunately, we had most of what we needed. There are also a few settings you have to change. We had to run this CTL on the box to enable some debugging flag temporarily in the kernel. But it's really not too difficult to set it up. And it's usually my first tool that I use when I'm dealing with CPU issues because it's a good way of visualizing where the CPU time is going. Perfect. Thank you. I need to search I use to have a Docker image that you run as privileged on such machine when you have the Docker engine that was able to automate all these settings immediately for native things. For the JVM part, I never used flame graph with JVM. So I don't know. For the run books, is there a public run book that I could update or maybe there's a private one that I could add? That's a private one. So we have private contents. Most of this content should be public but we haven't had the time. The risk is that there are some personal names, a sensitive information hidden somewhere. So there is a walk around extracting the public to private. Okay. Well, I can follow if there's something to update. I did. I need to create an issue for the upcoming milestone. That's part of the what I call run book access. That's basically adding you to the private repository and you will have access. Okay. Is there any other points on that area? Should we wait for the post mortem in order to have more information? And I try to keep track of what we say than having an outcome of what we plan to do afterwards as for today. Okay. So quickly jumping on the next topic. We were able to finally migrate writing.genkins.io from the AWS virtual machine into the Kubernetes cluster. The migration finished this morning by switching the DNS. So it's almost done, but it's still work in progress because we have cleanup actions that are listed on the tickets. We need to put the DNS TTL to something big now and we need to clean up the virtual machines and the former PostgreSQL database. So thanks for that work, Stefan. We had a minor issue. Stefan is working on replacing BlueChan in default display URL. So we ask the developers, we are waiting for feedback from the community if we do it or not. Reminder, it's not removing BlueChan, it's only changing the default link when you click on a GitHub check or generated link to stay on the classic UI. More people use the classic UI, more feedbacks we can make to the developer and people who are revamping that UI in the upcoming LTS and weekly. That's the equilibria. The work is done by Stefan as a preparatory work. So if the community say yes, then we can just merge it. That will update the data. Otherwise, we close the pool request and we go forward. The RV started to work on our ability to build our own Docker windows images on the infrastructure of private controllers. We were only using building Linux images. So that involves a lot of changes in the pipeline library because we need the pipeline library to be able to handle the PowerShell or bat command. And we have tooling aspect. We need to find and ensure that each tool we already use on Linux today for the usual Docker build and Docker push workflow that should work the same on Windows machines. So we are working on that area. We didn't have time to work on Sunsetting, Mirror Brain. I need to write a blog post on that area and I had issues with Jane Kinsayo on the latest Docker format, Problem Solves. So we can go back on that area next. We have the apply to Docker open source programs that will move out of the milestone now because we are waiting for them to apply the chance so we will benefit the rate limiting. Side note for you, Bezel, the rate limiting is one of the root cause that triggered the first outage 12 days ago. It might not be the core of the problems because such events should not break or should not make have some these consequences. However, it's part of the post mortem. So I just wanted to mention it aloud and here we will be sure to have way more API request limits. But if we don't or if it doesn't work, we are still at risk now. So we have to keep an eye on that area. Finally, one last bug. After migrating in Fra report to trusted the change on the pipeline library involved on that change had a minor impact on the repository permission of data. So I've reopened the issue until the program is fixed. Basically is that we need to update the virtual machine template we're using for agents. So they have the Azure command line installed. That's almost done. So that's all for the work in progress we have I don't know if you have other work in progress tasks from the past seven days. One, two, three. Okay. So now the new or important tasks. So we cover the CIG and Kinsai O outages which is the top priority. We have two new tasks that we need to do this week related to data. First one that the dog announced two or three weeks ago the depreciation of some of the syntax on cold system that was linking data dog to to page or duty and we are using these handles. So we have to use the new page or duty integration. So I haven't checked in details what is the migration path but that will be just deprecated. So we need to find a solution on that area. And that dog went to add a new monitoring. I mentioned earlier about the web UI of artifact. So we have these two new tasks that are coming on the stack. There are been two other let's say long term elements that are just behind CI the Jenkins ion priority but still top priority for the infra. First re-alignment of the repo Jenkins CI or mission that's a topic started by Daniel Bake a few years ago. We have an issue with G frog because we are costing them quite some amount of money and bandwidth and the usage is done on the repo Jenkins CI are not really legitimate. It appears that a lot of external organization are mirroring that repository while they should not. It's not expected to be a proxy tools and services such as the Maven Central or maybe us having a public proxy system should be used. But here the initial agreement between the community and G frog is that they sponsor us so we can use it for CI Jenkins IO and for the developers and direct for the plugins developers. But clearly the top eaters are artifactories that are in mirror mode from outside. So now we are working with G frog. We are waiting for them. They are trying to extract a list of the top heater public IP so we can start searching DNS name and IP for some people. But yeah, we need help especially from Daniel about the legacy things. There has been a discussion one or two years ago if I remember correctly about forcing authentication to be able to retrieve artifact from this one which is quite a nuclear option because that will require some a lot of work and that could have an impact on the contributors because they won't be able to Maven install a plugin they will need to configure the local Maven installation and then do it. So that might create some additional steps for let's say new big contributors. That was the core of the discussion. I'm just trying to transplant why it wasn't done like this but that will ensure that we don't have a lot of issue because we had a lot of performance or outages issues on that service which is outside our area. And yeah, G-Frog is hosting us for free so we need to find. It's impressive we have around 20% of the requests made to the repository that are HTTP 404. That's 20%, that's huge. So of course it creates performance issues when these peaks arrive because it uncash their underlying file system and create a lot of trouble for them. So I don't know what kind of implementation they have they might have technical solution on their area but they ask us if we could provide some data or search for the culprits because we are not expected to have so much 404. So there has been different solutions. I'm not sure if we will have the ability to work next week because we need action item that we don't have right now but we asked for help for Daniel. So Daniel we spent some time in the coming days to point us on some direction but that's totally worth the discussion to trigger and the mailing list for developers. And one last topic important I've started working on that. It's migrating the update center to another cloud than AWS because it's costing us 3K per month of bandwidth. And we cannot move easily on fastly. Reason is technically it's easy. The script which is generating the JSON every five minutes can totally on cache fastly immediately. It's one or two seconds. So technically no problem. However fastly like GFrog is costing us for free and we will clearly explode the bandwidth they expect to have from us. So the idea is to work with the CDF who are the person or the organization paying for the Jenkins organization to see if we can have an hybrid account. So an account where either they create a bill and then they remove some part of the bill as part of the partnership but we still have to pay the additional bandwidth. But last month it wasn't possible for them to have such an account which is administrative issue and not technical issue. Alternatively, the idea that Mark and I also added that can be complimentary is to move to Oracle because Oracle cloud has really cheap bandwidth like instead of 3K per month that should be 100 between 100 and 200 with the amount that we have which could be totally fine. And additionally we could benefit some better performances because it's a simple web server serving file and they provide IRM servers which clearly have a better cost performance benefits than Intel for this specific use case. So that is what we saw with other services that we moved from AWS to Oracle without tuning. So with tuning I'm sure we can do better but without tuning, without spending too much time we have very nice performances and it's really cheap. So these are the two main topics that are common priority keeping in mind that CI Jenkins IOU outages are our top priority for now. One last thing, Hervé did a proposal that I'm mentioning here. It's still an idea that need to be tracked that will be splitting the Terraform Azure project in two separated projects. One that handles the network and the DNS and the rest of the infrastructure Azure manage on the actual repository. The goal is the following. The Azure automatic management with Terraform has been stopped by Olivier two or three years ago because there has been Terraform provider updates that deleted one important DNS and deleted one private network. So we were up that it was not the public network but that might have an impact. And Olivier was alone he freaked out and stopped the automatic management. So two years later we have Terraform we have some archive Terraform that are not up to date so we are trying to re-go on that area. Problem is that we don't want to take the same kind of risk. And so the proposal of Hervé would help us feeling safer adding services database on Azure because we could have two different accounts and the default accounts for most of the infrastructure would not be able to delete the virtual network. They can only read and do the reference to them so we can have subnets or services inside these networks and by separating both then we avoid this kind of issue. So that that's nice proposal we will ask Hervé when he will be back to formalize that under an Elbes issue so we can have something written to share not only early during a meeting like now for the next weekly meeting. I think I did the tour of what we did. So the next step is as usual taking all the work in progress items from current milestone to the next milestone because we still have to work on them except the docker that exists. I'm going to set the priority to the CI Jenkins IO related tasks post more time tomorrow and giving access to Bazel and then we can close the current issue and start working again. Any question? I'm going to stop the recording stop sharing stop recording