 record. Recording has started. Hi, everybody. Welcome for this new Jenkins infra meeting. We have a few topics that we want that we'll cover during this infrastructure meeting. So the first one is Damian Solve to add up issues we had for a while. Just to bring the context, we are using multiple Jenkins instance in the Jenkins project and few of them. It's time we try to log in. We got a timeout issues. After 20 seconds, our authentication was rejected. And the thing is the configuration was apparently not almost the same than on CI, the Jenkins RIO, but the problems appear like two weeks ago. And Damian did spend quite a lot of time investigating to try to understand what was wrong there. So maybe Damian can just explain a little bit more. Yes. So based on some feedback from team first, in fact, it's been months that the issue was happening some time to time, but it started happening more and more recently. So we tried different areas that are all written on the associated issue. So the conclusion is that we had to use exactly the same Jenkins CAS configuration as what we have on CI Jenkins RIO, which currently work. And in fact, we had to specify some search pays. So the requests to the LDAP were not made from the root DN, but were made through the search pays. So instead of searching the whole area, it was on the support of the tree. Since we have a lot of entries in LDAP, the issue is that when Jenkins was sending a request, the time to process the request on the LDAP side and the size of the request when it was not time outed from the client side, that size was not able to be treated in time in less than one minute. So the result was we had a stack and the HTTP 500 error from Jenkins because time out. So by specifying the correct search pays for both user and group, and also adding an insensitive case filter to improve the usage of the LDAP index, then we were able to have the correct request working. So it takes almost 10 to 12 seconds, the first time after all it starts, time to populate the cache. And then Jenkins has its cache populated and it works very well. It's a bit slower on JDK 11 though than on JDK 8, and it looks like there are still some open issue on the LDAP plugin with JDK 11, where some class on the class loader are not there. So it writes some warnings on the log, but it's just, it's eight, it's 11, it's 10 seconds more for the initial request and then no difference. But it's a bit slower in JDK 11 for the initial request. And one of the things that we noticed was the member attribute in the group was not indexed. So we also saw many errors. So this is something that we could easily add to the LDAP container. I mean, that's one line configuration. The only thing is the first time that we indexed that new member attribute, it will take like 20 minutes. So the LDAP would be done for 20, 20 minutes. Damien, do you think that we still have to do that? No, it's not, it should be okay. Even though it could be good for the engine of the request to have the index on this attribute because we have a lot of request coming from CI release on Infra-CI. Another point was that the airback model that we are using on Infra-CI, assuming it's behind a VPN, is that administration is granted to everyone in the group admin. And there are some read-only access to view the jobs to people, part of a group named All. And by switching that group All to authenticated, which is the authenticated virtual group in Jenkins, meaning anyone able to authenticate through a realm, then also I was able to decrease the initial request time from 12 seconds to six seconds. So this was a tip from our Daniel Beck. And it looked also a good idea in the sense that even if the instance is behind a VPN, an authenticated access should be prohibited because it has access to all the infrastructure. The threat for this case would be when a process run somewhere in the cluster or when one of the agents of Infra-CI or release are executing a malicious process that already has access to the internal network of Kubernetes. So it doesn't change anything on the usage. We only have to be sure that someone has an account in LDAP if they want to access Infra-CI and read it. It's that you must authenticate with a valid to be sure you can read the list of jobs that's done. But right now you removed the group. So the idea in the past was to allow every member of the group All to have read-only access to those Jenkins instance. So in this case, you only, I mean, in order to authenticate, you need to be in the admin group. So we lost you Damien. Sorry. Now it's not what the RBAC rule says. The RBAC rule says if the current authenticated user is member of a group admin, then it's granted the admin rights. Otherwise, if it's only authenticated, then it's granted the rest of the rights. It's inverted logic. So compared to what you said. So with this, it looks like that Infra-CI is able to work again. So I encourage you, everyone who have access to test it. And tomorrow I will wait 24 hours. If everything is going right on Infra-CI with this new setup, I will propagate the setup to the release.ci instance as well. So specifically for the release. So the release was not affected in the fight that is still working. So normally a weekly release would have been published now. So it would be nice to double check that. Because if we don't have, I mean, if something happened there, we may have to temporarily put that configuration to the release environment as well. I'm not able to access it today to the release instance. Either I add five or three and five or four. Maybe it's another, it looks like it might be another issue. But in any case, I will wait for tomorrow. We need at least one full day of training on Infra-CI to be sure we can add this same setup. Basically what I'm wondering, okay, right. So Debian package was pushed today several hours ago. And Datadug haven't complained. So I guess that every packages are available. So I think everything went fine. So it should be okay. I've been having issues downloading rocket images with a new weekly version. It's not available. Yeah, it's not available. But the weekly is not built on Infra.CI, right? It's still built on trust.CI. So it's a different issue. And usually the Docker image takes more time to build. Yeah, part of the weekly release checklist is to invoke the Docker container build interactively. So, but you're saying that the weekly has been delivered. We've seen that. We just don't have a container for it yet. Is that what you said, Gara? I think that's the case. Yeah. Okay. So I guess we could simplify the process by moving that job to release.CI. I think it would be easier to do everything in one location instead of having to connect on trust.CI. Yeah. Does that then block people who Alex Oro, for instance, from seeing status that he could see before, that now that it's moved, if it moved to release.CI, he can't see it any longer? Technically, for Daniel, it should be, for Alex it should be fine. Okay. Because I guess he's still in the right group. But yeah, different. Yeah. No, I wouldn't, I wouldn't worry because anyway, really little people, I mean, every few people have access to trust.CI. So for those, for the people who have access to trust.CI, we could easily grant them access to release.CI. Got it. Maybe not, maybe not, maybe they won't be able to trigger a job, but at least they should be able to watch the results. Right. Is that all on the adapt topic? Yep. Thanks. Let's move to another topic. So the next one that I want to bring is the Captain Hook. So this is a small project that carrots have been working on. We haven't deployed yet, so we haven't tested yet at the moment. But yeah, the bear is spinning. Maybe Garrett, you want to explain a little bit. Yeah, sure. So it came out of the Contributor Summit on the Cloud Native Track with one of the illustrations of running Jenkins in a Alamo Docker environment was that when it restarted, you tend to lose webbooks. And if you're trying to continuously deliver Jenkins, it's restarted quite a bit. Especially if you're using the process of building the plugins into the Docker image and doing it that way. So the idea is just was a simple sort of a very much lightweight webbook handler that could store and forward webbook events and get help to Jenkins. Or in theory, mail a thing to Jenkins. So I could paste the link to the actual repo in here. It seems to be working quite nicely. I have it on a test cluster. And hooks are being stored forward. I can take Jenkins down for a period of time and then when it comes back up, they all nicely recover. It adds the hooks in as a CRD. So you can debug what's going on by doing QCTL get hooks. You'll see a list of all the hooks that are there and the status of them. You could add something to force a retrigger of a particular one or at least you'll see the error message about how your hook is being rejected or not. And I think that's about it. It probably is ready to go into infra, at least for a bit of testing because we can test on a subset of repositories. I suppose that questions are, should we use a different ingress post and then do we need firewall rules to allow that host through or something else? Maybe some DNS settings if there is a new loose name for that particular endpoint. The reason the build is failing at the moment actually is because it's unable to generate the diff comment and I don't know why. But now that I can get back into infra, I can at least investigate it. So regarding having a different ingress, I think we definitely need a different ingress. Having a new DNS name is pretty easy. So maybe you can just take like 30 minutes and do that together. We just have to decide on the name. So does that mean that when you add the new DNS name that GitHub will be told what that location is and will send web hooks to there in addition? Is that how that works? So you could either configure it. I would probably recommend configuring it on a repo by repo basis initially. So once that we start off with some repositories that aren't particularly important just to see that they are going across because if it starts to fail, we don't want to break too much. So update those manually. Apparently there is a process in I think the GitHub branch source plugin or one of the plugins where you can specify a different workbook endpoint and tell it to update everything. Although I think from my investigation, the way that we have infra and releases configured, it needs a different style of credential to be able to handle that. I'm not quite sure. So the way we are doing it on most of the jobs is using a GitHub organization. In the case of the GitHub organization, I don't know for multi-branch, but in the case of GitHub org, either you're using an OOOT token or a GitHub app. In both cases, the GitHub organization has authentication and the rights to create the webbooks. By default, you try to create it at the organization level, which means you don't see a webbook per repository by default. And it is an organization level webbook that says each time there is a repository in that org that has a new pull request, whatever event you select, then it will send. So the configuration is centralized. I know for a fact that the multi-branch webbooks should create automatically webbooks if you define a GitHub source. But in reality, that's really a pain. And I'm not really sure if it's working as for today. It wasn't past year. During all the year, it was not working on the LTS. So this is something to be checked. That's a good point you're doing, Mark, because it's really a pain to manually check the webbooks. I expect when I create a job, the webbooks to be created, or to tell me it's not with a message, which the GitHub organization scanning is doing correctly. But yeah, we have to check. And you are correct. I'm sure the GitHub organization is able to specify another endpoint. We have to add the webhooks manager, most of the Docker prepos. But how does that work? How does that work when you have... So basically, you receive two different webhooks. You have the classic one that arrived right now. And then we would have the second one that just handled when Jenkins is down, like a cache, or something that's... So basically, you would receive two kind of webbooks at the same time. And that's our new product here. That's if we have the organizational level hooks installed. I mean, I don't know. We have no access at the organizational level to see or debug whether or not that is even happening. My thinkling is that it's not happening, because there's none of the repos that are getting webhook events anyway, unless you actually go in and manually define on a per repo basis. Which we should not have to do, unless we have one multi-brancher variable. I think it could be because there's a difference in the credential type. There's something to do with that, like the bit that updates the webhook at the organizational level can only handle a particular credential type. And we are using a different one. That's why you can't invoke it manually to step back. Okay, I think we had to configure a GitHub server on the main many Jenkins page one time, linked to the GitHub server. And then when you create a multi-branch GitHub source, in fact, internally, it tells the Jenkins administration to declare the webhooks through this kind of internal proxy. So it was not directly the multi-branch in itself, as far as I remember. And this is this kind of internal settings that sometimes pop up an administration monitor message that say, hey, I was not able to create whatever webhooks event because blah, blah, blah. And sometimes you have a bunch after a restart. But yeah, we have to check all the cases because there are a lot of cases depending on kind of jobs. This one is really tricky. And what happens if that, to answer your question, Olivier, if you have webhooks that come from Captain Hook and another, it will trigger two scans on each repository changed. And Jenkins will determine that there is no build to run again because if the first one trigger a build on a given commit, then the second scan will say nothing to do. So it's not an issue at all. We can have a multiple concurrent, so different webhooks. Okay, thanks. Thanks. Right. So I guess we are good on the Captain Hook topic. The next one is brief. So Mark sent, published a blog post about the Update Center Certificate Rotation. So the initial date was proposed to do it on next week on the 22. But because of the time we took to find the last blog post, we'll do that on the 29. Generating a new certificate is pretty simple. We just have to modify trust.ci with a new certificate. So the next time the Update Center tunerates its data, it uses the correct certificate. Again, it won't affect, I mean, won't affect user who have been updating their Jenkins instance for the past two years. So, I mean, that should be fine. But that remains a pretty big change anyway. Next topic, which is about unduty experience improvements. So Mark, you had a session with Garrett and Mark two days ago, I think about that again. Pedro, do you want to share that here? Yeah, just that what we did was we went through a series of exercises trying to understand what we could do to reduce the alert frequency or refine the alert frequency of things coming from Datadog. So I had received an alert on a weird response time. And so with Damian and with Gareth, we went through talked about it and thought, okay, what we probably ought to consider is rather than monitoring on HTTP response time, long term, let's shift our monitoring to something higher level, something closer to the user because the short term HTTP response time wasn't actually an indication of a user problem. It was just a hint. And now what Gareth noted is now that Datadog has the notion of SLOs and we can use those and give ourselves a more accurate, more reliable alerting. Gareth, did I say that reasonably? Did I miss something? Yeah, let's go. In the short term, I think we may just adjust the thresholds and there, I think Damian now knows how to adjust thresholds through terraform changes. Sorry, and that's all that I had. So just mentioning the thresholds. So we had quite a lot of notification about service getting slow over the past few weeks. And most of the time, those issues appear and goes pretty quickly. And it's just like a small peek in the usage. The challenge that we have here is, so we are using Datadog to monitor the HTTP endpoints. And historically, we were just monitoring endpoints from one machine from the Puppet Master. So that machine was configured to ping the main website, play inside, and so on. Several weeks ago, what we did is we also configured the Datadog agent running on our privileges cluster to ping those endpoints, which means that now instead of just pinging, just checking if the service were up from one location, we are doing that from that location, but also from the communities cluster. And at the same time, we are also using Datadog synthetic. In that case, that's Datadog who provides the monitoring agent and check from, I think, from Germany. But the challenge that we have here is because we are checking if one endpoint is available or not from one location. Typically, websites located in the U.S. are pretty responsive. Usually, they are under the request for one second. But at the same time, service that are closer to China, like Mirals in China, they usually take three seconds from the U.S. to answer the load, to answer the request, sorry. And so, yeah, that threshold can be tricky to put in place, depending on how close the monitoring agent is from the service. And so, yeah, that's it. Don't forget, it's not the thresholds that we are looking for. This is what the SLO defined. So we should not change. The threshold is the vertical measure on the graph. It is very easy. While what we want with the SLO is to say if this picks up too often, so on different probes, then it means that it impacts the user. And then in that case, it should awake or raise an alert and awake someone. So this is a second level of aggregating different probes results and applying more than just an average, because the average function is not enough there. I was just explaining where the data were coming from and that we have to keep in mind. Okay, thanks. Just wanted to be sure there were no mid-systems. So, yeah, and last topic, which is infra-budget update from Alek, that we want to bring that topic here. Is there anything new that you... Just a quick update, but good news. So, CDF board has reviewed the 2021 budget. They confirmed that we can press it with a budget from 2020. So what it means is that we still have 10K for Azure sponsorship plus some additional expenses, if we need. There is a disclaimer that they might reach out and to see changes if there are more projects during the continued refundation. But for now, we should be fine. And from what I've seen in the Azure console, we are well below the limit. So, yeah, good news for everyone. So that's a great news and basically what you're telling us is we should reap attention to not accept new projects in the CDF, right? Right. Yeah, that's perfect. That's a really good news because that means that we wouldn't have to pay too much attention. I mean, last year, we spent quite a lot of time reducing the costs. And yeah, that's really nice that we can work on other things. Yeah, yeah. Of course, CDF would appreciate if we keep reducing costs. Yeah, obviously. Yeah, there is no strong pressure to hit a lower budget target at the moment. That's great. Anyway, our objective is always to be less, to less rely on sponsoring. So we can have, I mean, less we have to pay for, better it is, because if we don't have, I mean, if you don't have pressure to reduce the costs, if we don't have strong pressure to reduce the costs, that means that we can plan more in advance. But that's definitely a great news. Yeah, we can add more services if needed. For example, the suspended proposal about security scans, which might require some more tools on our infrastructure, the results plug-in delivery pipeline, which is definitely expected to happen on our infrastructure. And since we have some budget now we can use, so we can fund these efforts. Yeah, what I noticed is we usually get right now between 8K and 9,000 usually. That's, I mean, sometimes we go above the 9K, but that's nice to know. Thanks. Thanks everybody. Thanks for your time. We have two minutes left before the end of the meeting. So if you want to bring a last topic, that's a nice moment. I need help on the budget parts to evaluate how much we should ask for scaleway. Taking in accounts that given the last mail, I'm not sure they will give things for free. It looks like they are going to give us a reduction of a discount. So I might need help on evaluating how much and guiding me based on the, now that I'm sure that we have 10K per month, I will have, let's say, a major amount of what the big group provider can give us. So let's say targeting 5% of that, for instance, could be a great way. If you need, we have a document, we have a Google, we have a cheat document where we list all the costs and sponsoring and so on. It's not a document that I showed publicly, but if you, I mean, I would be more than happy to grant you access to that. If scaleway is just a discount, it's unlikely that we would proceed there. Yeah. When there is a opportunity to get basically credits from other providers. No, I'm not sure to get you. So scaleway is basically a French hosting provider, right? Yes. Austria and the Netherlands as well. And Bangkok or Singapore. Dominic correctly, there is no sponsorship offer. There is a discount offer. It's not there. So we have to ask. That means we have to ask until they don't say no, I consider that they can still sponsor. But I feel like that we're going to the discount way. So in that case, it will be more complicated. That's roughly roughly akin to what we've seen with Oracle is Oracle has offered us a discount. Same thing. So it's worth considering. They've also donated 1500 to the project outright, but 1500 in terms of our infrastructure cost is a tiny amount, right? We have 10,000 a month that we're spending at Azure. And I would guess comparable to that from AWS right now, right? So we've got significant expenses, but it's worth at least evaluating. Continue the question to scaleway. So right now, we are spending around, I mean, close to 20 to 20 k per month, the Jenkins Infra project as zero represents half of the cost. The thing is, it takes time to put in place a billing process. So if we just have a discount, that's annoying because then it means that we have to work with the Linux Foundation to have invoices and stuff like that. And so if we have to go with the invoice, I'm sorry, with the discount, which is something possible, but then we have to be sure that it's worthwhile. So let's say we, I mean, if the benefit is just to have a discount for one year, it probably not worthwhile to spend the time there. But otherwise, yeah, it may be useful to have a discount as well. But I think we'll have a better vision once you're done with the Amazon cluster. So once this is working, and we can use it on the CI.jtk.io, it will already be easier to identify how much. For information on the issue, if you're interested in this, I've already made some measurements and predictions on the cost and the kind of machines. One of the area I want to explore is using machines that are not using EBS. This is the block storage mounted for persistence between machines, given that the goal of these clusters is only to bring up FMR agent and switching to machines that have NVMe local drives instead, which AWS provides, which Scaleway provides on GCP as well, because you don't have to pay for the EBS. So what is the added cost of the NVMe machines is gained on the EBS roughly per month. The cost is, should be at least half of what we spend on the ACI cluster for CI.jtk.io as for today. And the performance is already one grade per. And also we are not tied to a cloud provider specific block storage. So in that case, we have a perfect, we have the context to try to avoid this, so better having a fast drive for that. Thanks. I just have one last topic that I just want to bring. I don't think I mentioned it, but so I worked on KickLock over the past week to update it. And I would need some help from someone with Java experience, because one of the limits that we had to replace KickLock, to replace the account app by KickLock was the rules that we put in place for the username. I mean, during the registration process and KickLock allow us to override the class that defined the registration process. So we could inject just some piece of code from the account app into KickLock. So we would be able to finalize the migration from account app to KickLock. So I put the documentation here. I won't spend too much time here. So I put the documentation here. But yeah, if someone is wanting to give me an app here just to identify how much a sport is needed to do that, I would be more than happy. So basically the two last statement that are missing to officially use KickLock is first for Daniel. The account app automatically injects user in JIRA, which is not done by default because of the way the LDAP plug-in, the way the LDAP connector works on JIRA. And the second thing is the account app ensures that we don't use specific names like admins and stuff like that in the username. And so this is also something that need to be put, that need to be added for KickLock. But yeah, that's it for me. I propose to finish the meeting. We are a little bit over the time. So thanks everybody for staying until now and see you. Bye-bye.