 Hello everyone, welcome to the Jenkins Infrastructure Weekly Meeting. Today we are 23 January of 2024. Around the table, we have myseptima du portal, Erwelemer not there yet, Mark White, Stefan Merle, Kevin Martin, Saint-Brunuvaart, and our holder. Let's get started with announcements. First of all, we didn't have a weekly release as a reminder today. It has been postponed to tomorrow. That will be delivered, that's the second announcement. During the security release, that will happen tomorrow. A core security release announced one or two days ago. That will be scoped to a new LTS line, former LTS lines, and the weekly line. So I'm surprised that there will be something on a former LTS line. Oh, maybe I'm saying something wrong. I thought there was at least the previous LTS. If they do it, that's great. I just wasn't aware of it, no problem. Okay, I might be wrong, so don't take that information. 2.426.3 is the one I was expecting. I wasn't assuming they would release any earlier version, but on.1, security releases, sometimes they've done that. I don't remember them ever doing that on a.3. I'm not sure, so let's consider we only have at least one LTS and one weekly, that's already a busy day. I will need someone to drive the meeting next week. I will take care of publishing last week and this week recording tomorrow, of course, but I will be off next week, Monday, Tuesday, and Wednesday. I won't be reachable at all, so that's why I need someone to drive the weekly meeting. Hopefully, that will be a quick one, since as you can see, everyone will be going to FOSDEM the week after. So we still have a packed milestone for the upcoming one, but in two milestones, that will be really slow. So I had an open question in a discussion separately that shouldn't we just cancel the Tuesday info meeting the week after FOSDEM, given that, for instance, me, I'll be exhausted having just been on an airplane, getting from Belgium back to home all day Monday and others have similar travel long days Monday. Would we be okay, would it be safe to cancel the meeting after FOSDEM? I vote four, just thumbs up, all thumbs down for everyone. Yeah, so I see enough thumbs up to say yes, good. So I'll remove it from the calendar. Thanks, Marc. Okay, I will adapt the milestone for preparing for next week, then I will prepare them in advance. And so we'll all see each other six, the 13th of February then. Okay, is there any question or points or elements related to announcement or additional announcements? Nope, okay. So next weekly, tomorrow. I don't remember the numbers. Let's have a look at infrascii. I think it's 442. Next till tomorrow, 0.426. Don't remember the numbers by hand. Yes. So of course we have the Inkinsadvisory tomorrow. So we must not break the infrastructure except for the NAT gateway and let's say the most emergent tasks. Next major event, of course FOSDEM next week. I think it's 13th and 14th of March for the scale. Correct. Is that okay for everyone? I might have an announcement related to operational elements. There might be unavailability of CI Jenkins. Either a Thursday or Friday, depending on how things are going tomorrow. That need to be announced. That might be a 24 hours advance announcement. I would have wanted more, but it depends on the security release. Un avail label. Un avail label for migration either Thursday or Friday. The goal would be to migrate the controller to the new subscription so we can avoid paying for it even if we still have one week left for January. There will be a preceding on Wednesday and upgrade from 426.2 to 426.3 on that. But that's not this. This is really a change of which host we're using to run. Exactly. Thanks. Upgrade as it depends on the security release. Great. Thanks. I will give more detail on the upcoming, but that's a major announcement. Let's get started. A user lost their permission on a GitLab branch source plugin that was a consequence of the security release that locks some elements when we approach soon to avoid people accidentally merging to the main branch. Branches and having the security team having to transplant changes and release again the bits they staged a few days before the release. So that has been fixed and cancelled and it was a matter of communication and in the end nothing to be improved as I understand because the plugin maintainer did not see the email and they tried to access open the issue and then realized that they had an email saying hey, you won't have access for a few days to your repository. So no actionable on the infrasight. I don't know if you have question. Encourage installing the application to see a banner on the repository when you are a maintainer. I'm not aware of that application so I'm also learning something in the process. What kind of application is it? It's a browser site application which adds a banner on the repo when there is a security release concerning the repository. I think sorry, I have to check on myself. Okay, got a check or had I mean we will see Vadek as far as I understand during the first day. Is that correct, Marc? Yes, that's correct. So that will be the perfect location to ask him then. Thanks Marc for handling the Gira license. That's been renewed. Thanks to Linux Foundation for doing the work. Thanks, Elif and Marc. Thanks, Alex, for taking care of the crowding request. So crowding is a mechanism, is a system to help translating plugins if I'm correct. So maintainer requested access to that system in order to help a translation of the labels of their plugin. So no actionable for us. We have seen issues with Windows agent on CI, Junk in SIO. So that one, it looks like it has been a combination of different problem. Main problem, not completely gone. Network issues, it's not exhaustion. So I will give more detail later. But we add issues on the setup of our public network, the network which is used by these agents. So that's the first root cause. So I'm going to go back to my so that's the first root cause. That thing was causing the outbound request to either repo agent in CI from the ACP or from the machines running on the network when they try to reach the outside. That was causing the collection to pile up. And this connection by being piled up were reaching a maximum outbound connection slowing down effectively the agent. That's a specific case of Windows agent inside container. It's because our virtual machine have their own virtual network which has its own setup. But in that case, the network was partially shared with the faulty network because Windows container agent are using a service on Azure name. I always forget the acronym. That's Azure container services or something which is under the hood the Kubernetes managed by Microsoft themselves. So some part of this agent were spawned on the same network as us and was slowed down. But also there were Azure incident which were mostly on network. It's always in a DNS, right? That's what we say. I don't know what was the detailed incident. But yes, Azure was having a lot of issues. I'm amazed because they did not publish anything on their status site or the block site. But the Azure console was mentioning the incident. So there was something really weird. Now we have contained the net without fixing it yet the SNAT exhaustion problem. And this problem are gone. So we saw normal build time on Windows agent again. That's why the issue is closed. However, please note, the network problem on our side are not fixed yet. Mark, can you confirm that you haven't seen let's say extensive build time on this kind of agents in the past days? I have not. Yeah, I've not been actively looking for it but I've seen no issues. Cool. Next issue is partially infrastructure, partially Jenkins project itself. So there has been a gain once every six months a rebuild of existing tags on official images which is really more than annoying. And in that case, it was only on the weekly the LTS release was partially impacted. So people with good practices did not have any impact. People using latest tags such as LTS-IDK or something were impacted. But somehow, I hope it's a valid listen to say, hey, latest can change at any moment. In that case, yeah, I'm partially joking because in that case that was a regression to an older version which is not a good thing. The reason is we don't know and no one was able to understand why is Trusted CI sometimes deciding to rebuild an existing tag march marked as already built. That make no sense. There is no log nowhere. No message that's mystery. So I've proposed a radical method is to remove these old tags. I was thinking you would say non Trusted. We could, but then you have to run and hide because trust me, people like Daniel really, really are good at finding you in this case. All the tags to get rid of this issue at all. Yeah, that's all. The problem should be solved because there is no reason to rebuild the other tags. Elvay, can you just give us a summary about the splitting that you were able to achieve between weekly.ci and infrasci. So last week, I building both variants from the same repository was active, but there was, I did a different lifecycle between them. So when only plug-in list from one or either variant is updated, only this variant is built. But then when we are consuming this image on covenant management, I had to adapt the update manifest to retrieve either infrasci.ci variant or weekly.ci variant depending on the tax effects if it was available. As a next step, I want to add a step to build docker image. So it reports, it happens to the release not to the GitHub release not a list of tag. So we could check directly if only one target has been built, the available image on tag. I'd like to expand that to other Docker repo later, but we'll see later. So the issue is closed and this was a variant as our own lifecycle now. I'm willing to be consumed. Okay, so does that mean that Stefan can proceed with the plugin update on both today? So that will test the whole process for tomorrow and we will have to quickly deliver the weekly bits. Is that okay for everyone? Yep. Cool. To add quickly, I've tested today building only the Wiki.ci variant and see that the create only was the Wiki.ci for request. So all good. Nice job. Okay, so Stefan, the road is yours then. Cool. We had two issues closed as not banned because the requester never answered in that case. Thanks Mark for taking care of giving them pointers but never answered. Now, work in progress. In term of priority, just a point, I did not have time due to my days off, day on, days off, had time to summarize the budget consumption. We are doing good but not good enough. That's the summary, particularly Nazur. Given how the CDF pays the bill to constrain the subscription and everything, our top priority for the two immediate week will be to decrease the Azure billing on the normal subscription paid by CDF. So for that, that mean we have to we have to work on two areas and I want to start this video system is driving me crazy. I don't know to disable that. So we have two main problem two main tasks. The first one should be quick. We haven't worked on this. Let me get okay. The first one is around Gage and Kinsayo. So I've opened an issue underlying that while searching and watching the Azure billing, it looks like that we spend 1.8 up to 2K per month only on the shared storage used by Gage and Kinsayo for mirrored unloads of the plugins. Thing is that should be cheap storage because it has been initially used instead of disk in order to not only allow sharing it to have to want it concurrently on different pods and machines but also because it should be cheaper than an SSD for storage. Truth is it's cheaper for the storage part itself but we are billed by operations and transactions and the pattern we have with the data center running every 3 minutes meaning it's right every 3 minutes we are means that we are paying a lot for these operations that also create other subsequent problem. I've made two proposals based on Azure documentation the first one is to go back to SSDs which involve a lot of engineering but it means better performances The other proposal is to switch to premium storage accounts which means creating a new storage account migrating the data migrating the application and the update of flow in both cases we should be able to decrease from 1.8 to around 300K that's really impressive based on the projection unless we miss something of course but theoretically we should be able to drastically release our our bill on that case The initial proposal we discussed Stefan Hervéon High I've also pinged others such as Morg Team and others it looks like that the migration to premium storage will be the next step because that's the one that require less operation most of the safety that should be a good middle ground The idea is that the premium storage is a different implementation system so Microsoft cannot convert the existing storage to one so we need to create an empty one and copy the 500GB However, that's premium storage is not billed per transaction so that's why it's premium but it costs way more per GB The order of magnitude is that that's the same cost as one SSD premium disk but it keeps the property of being able to be accessed and mounted on different machines which is why that's closest to the current paradigm we have today so I would prefer to stick to that model today Is there any question on that topic So that one should be one of our top priority Stefan Do you think you should be able to get started on creating the premium storage in Terraform because that's the second trick is that that storage account we have here was never managed through Terraform it was manually managed so we will need to start from scratch and copy the data that will be an opportunity to clean up but that could hide some surprises so the first step here will be create the empty storage and then we will find a machine that will take care of copying the data and once it's done we can start preparing the migration So, Stefan, are you okay to start the work of creating the empty premium storage I will try First update of the plugins and then I will try to start that mean an empty Terraform file for get Jenkins Io Exactly and then you can start from the Hervé's work on updates Jenkins Io.tf file except that you have to set up the new storage to be a premium yeah and not the VM but just the storage okay Related questions Does the LDAP file share could benefit from a premium storage too low point 50 but many operations I imagine I'm not sure if it's not a disk okay because we had another issue in the Azure repo that I linked it to mine because we were trying to recreate the LDAP file share as code okay worth checking that mean that's a good point to see how much do we spend on that shared volume on LDAP as well let's see let's see I think if it's not a disk it should be a disk I don't see a reason of using maybe it's the backup I don't remember but yet worth sharing and that benefit from the same upgrade is there any other question on that topic okay next topic is migrating see agent Jenkins Io to the new subscription that's the second way to ensure we decrease the build so we have created and a new omtvm the new subscription now the next step described on the issue are let's set up the machine with Poupet start copying the data of the current see agent Jenkins Io secondary drive and once the data has been copied once test if it can start properly a controller and then we plan the migration then copy data one time once the security release is done we can start the real migration and run everything on the if or data we already did that when we changed the virtual machine that took one or two days and now we have everything set up so the only issue that could appear will be permission for spawning agent or accessing agent so that will be easy to test during the migration the expected gain is around 500 dollars monthly so that's that's still something visible okay it looks like dollar is having a special in markdown any question on the see agent Jenkins Io migration so I'm taking care of this one I'm gonna try to associate either Hervé or Stéphane depending on when we are on the contents right now I was walking a bit on submarine mode so now it's time for challenging with the rest of the team is that okay for for everyone so this this one this one the transition we we hope to do after the security release this week even or no your it'll be next week this week it's really hope for this week okay great thank you and and with that change Demian are we does it look like we're within range within within budget for cost explore or do we have to also do the for january or do I need to forewarn cdf that will be a little high on january we will be a little high I don't know how we can do okay I don't have other solutions no problem I am happy to alert them that we're we're we're taking steps and we'll be a little high that's great thank you any other question on that topic okay okay uh next one the snot port exhaustion so we follow the first intermediate steps so snot port are the mechanism used when you have a request going outbound for instance you are on an agent or somewhere inside our system and you have a request that goes to the internet or outside you as your cluster in order to correctly answer back to the request the remote interlocutor must have a couple of IP and port since you have just a few public IP usually one that mean you need for each outbound request to allocate a port so that the answer will go back through that port otherwise if you have two concurrent system that reach at the same time the same address such as google.com then out as google.com know which one of the two processes send the request that's why you need distinct port on the public IP and the problem here we had is that we were reaching a threshold of maximum port that could be allocated meaning your request wasn't allowed was piled and queued waiting for a port to be available in order to reach the internet we already decreased the tcp time out of this request because you had to wait until a connection was confirmed closed for 30 minutes by default now it's only four minutes that also that's the setting that made the agent way more efficient for windows container agent then we tried adding more port statically because by default Azure tries to adapt the amount of port by doing a division with the amount of hosts needing to reach the outside more ports instead of dynamically problem did not add any effect because as soon as we were having a deployment and our scaling and then scale down of our systems that was adding one more host moving and killing all of our routes and as soon as we want to play on the node pools on the cluster we are dead we add a new machine we are dead and made operation worse so that's why we want temporarily to mitigate at least for a few days we added more public IP I mean that's a solution that helps however we still have statically defined split so some public IP our full and some are empty going back to dynamically could help but still it's a problem because we have to pay for this public IP these paid resources these paid and rare resources so we are working right now using a NAT gateway thanks from a lot of work on Microsoft and people outside on their MVP we're about to find someone describing that même though if you tell the cluster to switch from out bone load balancer like it is today with the public IP to a custom NAT manage then the cluster will be destroyed and recreated which we don't want however if you set associate the NAT gateway to the subnet used by the cluster the cluster is not aware that it needs to use a NAT gateway but the NAT gateway takes precedence at the router level and the good thing is this in this is that if the NAT gateway infrastructure at Microsoft suffer from any outage then the cluster use again the load balancer during the outage so we have a highly available out bone system on in theory and we are testing this in addition and precedence status private gates is now using NAT gateway so we tried only private cluster just to be sure we don't break everything good thing because as as stefan can testify we broke into a site for 15 20 minutes because everything work as expected but since we have API protection on kubernetes which restrict the IP that can access the control plane to spin up pods or get logs or do whatever with kubernetes using the NAT gateway effectively change the outbound IP of our resources meaning infraciai wasn't able to manage kubernetes anymore neither our request from the pods neither the cluster itself so no more auto scaler no check nothing we killed ourselves by the way so listen learned we retried from scratch we disassociated the NAT gateway validated through the matrix in azure that we were back to previous state had it allowed the NAT gateway and then we tested at enabling it again and it worked flawlessly so now to do public gates I'm could you could you I'm sorry for my education one more time so that the magical difference between it working and not working was allowing the public IP NAT gateway which is the apparent IP of request made to kubernetes API we need to add this IP or this collection of IPs in the control plane of AKS okay so you had to teach the AKS control plane that requests from the NAT gateway are valid requests to the control plane I see okay thank you so now we have public gates and as your terraform is currently being disabled please don't enable it at the end of the meeting we will finish fixing this element is there any question on that topic so so no risk from this to the security release tomorrow world this is this is not has no no expected impact no expected impact great okay in that case it's because the release ci is able to reach internet now and release ci is allowed already and trust it doesn't use the this NAT gateway these two particular NAT gateway at all great thank you next one agents are unspawning on infra ci stefan can you give us status summary of this one yes I had to open an issue on azure because in fact when I saw this I understood that the not pooled IRM mode pool was at zero so no known pods were able to be spawned and as soon as I manually spawn one node the autoscale triggered and then spawn multiple nodes until 10 and was able to process all the pod creations that mean that when you're at zero it's not able to spawn the first one but then it's working fine so that that for me that mean that the configuration is good but there's a problem with the zero that's that's what happened to be told by the people from azure as the first answer that I got was we we and the best way to work is to start from one it's recommended to start the squalling at one to ensure proper functioning of the autoscaling mechanism of course that's not something that we will want because that mean paying for at least one node all the time if we really use at least one node that would be okay but I'm not quite sure so I'm I'm still asking question to the azure tickets and team answer an hour ago that they had the same kind of problem we spot instances when they start the autoscaling at zero so maybe we need to to dig something around that because we I don't think we use spot instances for node in IRM I don't remember that but I'm not quite sure not sure worth checking solution could be migrate creating a new cluster only for that on the new subscription because spot is not working there and at least we won't and we're sure that it's not spot yes so for now I'm waiting for an answer from azure okay is there any other comment on this one oh okay next so that means it's fixed and now we are waiting on the reason but that need that mean we might need to find a solution to not spend too much billing that's not fixed because I did change to one in the UI not as code so as soon as Terraform has come around it get back to zero so maybe that's back now maybe I need to put one as code to make sure that we cannot given the billing constraint yes that's what which mean we have to monitor it but that mean we might see slowness on infrascii and in that case we have to temporarily scale up the amounts so please people tell us open issues okay thanks for the reminder Stephen because that one is annoying okay one solution or the alternative solution here I'm thinking a lot could be migrating my I talked about migrating infrascii to RM64 but it's not a good solution because we decided to separate the node pool where we're on controller and agents it's not the same set of virtual machines so the problem still stands so if we don't have an easy solution Stephen that mean we we have to create a new cluster only for infrascii in the new subscription but the solution will be to move everything to the new subscription agent and and infra yes and no if we can separate sensitive controller from the agents because Kubernetes is the worst for security and network isolation but we can we can create two node pools in the new subscription one for the infrascii and the other one for the agents like we want we want to do here yes but in any case you have to start by creating a cluster so if you create a cluster with node pool that can scale to zero then move the agent workloads then we can see okay what is the next problem because from the billing point of view we pay for infrascii machine today and we will in the next month however the for the agents if we can move them we won't have to pay for the one minimum node and we can solve the problem that way yes great okay so that one is the next priority because it can have an impact on billing there's a lot of priority yeah yes but if we don't have money to pay the billing next I mean so yeah we need to act and swiftly unexpected delays so yeah I propose we will do a temporary checkup and a tweak Stefan on airway I might need your help to focus only on this one if needed okay okay while I'm gone next one unexpected delays building small plugin on linux agent I believe airway you started to walk and diagnose this one yeah so I confirm that there is an issue with digital assembly ACP provider I reproduce the longer time and short time with ACP disabled there is no issue with the Azure and AWS provider and you proposed that we start the Kubernetes 1.27 upgrades on digital assembly so it will recycle every node we can also disable this provider in the CIDA Jenkins.io configuration meanwhile so a part of the plugin build will not use the ACP but will query directly to gfrog I don't know it's a possibility what do you think Mark because I remember you added a message about you want the bandwidth to looks good for gfrog on January this month yeah I was I'm so I think I'm okay with us just living with digital ocean ACP being slow but it is interesting that it would only be on digital ocean that we would have the that we would disable ACP right because we're not seeing the same behavior on azure and AWS yeah I haven't speak a lot on this issue more work on right and for me while I find it distracting distracting right now is for me not our top priority right we've got to worry more about azure azure cost management and we've got to worry more about other things so if this one just needs to wait I think that's okay I mean we we certainly have a workaround right the workaround and every you noted the workaround it is I can disable ACP in any Jenkins job by a single argument to the Jenkins file right I mean I I can I can disable the ACP at any time if it gets in my way the reason I believe is because the way Kubernetes manage cluster on digital lesson or are are handled approaching one month before depreciation mean on the storage system I have a gut feeling it's related to the storage system we already had this twice in the past 18 months and when we are close to the depreciation of the Kubernetes version we are running it looks like their storage system is deprecating some APIs silently step by step leading to something working but slower that's why also the kube 1.27 could help determining if I'm correct or not because we have to do it before end of February for digital ocean in any case and the CSI persistent volume driver are there there could be a solution now that will consist in tainting and draining the existing node to force the autoscaler creating two new machine even on the current version that will have the same effect but if they have deprecated some CSI storage API on their storage system that won't have any effect so that's why kube 1.27 will be a one operation and then we'll see a different behavior or not alternatively something that every proposed when we started the ACP project it will be to get rid of the the cluster the Kubernetes cluster we use for digital ocean in that case ACP is part is specific enough in the way how it work and use system resources to not run on Kubernetes it's not made for this so maybe switching to one or two virtual machine could be an alternative solution that's something we already explored in term of billing because we saw that we cannot scale to zero on Kubernetes for DO and we mentioned and drafted something around using agent with virtual machine instead of container or Jenkins controller that will be the same idea I mean Kubernetes is something but it's not the golden bullet and here between the performances issue we saw when we reached certain threshold of parallel request the way we we manage ACP this kind of regular problem we could also call it a draft and say Kubernetes is not made for running that kind of workload something to think in the upcoming month so those are those sound like all potential long longer term solutions that feels reasonable to me in terms of guidance for for short term if we disable the artifact caching proxy on digital ocean what any any guess on what fraction of our total jobs go to the digital ocean kubernetes cluster I am looking to tata to to see the proportion of bandwidth consumed by every provider okay yeah it's and it's it's I'm not sure it's worth the investigation I just know that that simple plug-in that that tiny plug-in I was working with did choose digital ocean and I assume that that's because it was configured to use container agents like we prefer I didn't see the previous issue but I think digital ocean provider was also concerned in the previous command from Brazil okay and on one or two issue he commented about ACP seemingly related and I will vote for initials survey proposal let's disable for the upcoming two weeks digital ocean agents it's an easy and quick thing it's a one line change one enable true to set to false once deployed to CIG and Kinsayo CIG and Kinsayo will only spin up containers inside AWS instead of DO and AWS that will increase bit the AWS bill for the time being but that will ensure we have good performances and that could allow us to focus on other tasks because as Mark said even searching that the the dog is wasting time instead of focusing on the Azure bill and I think if it's useful it's not wasting wasting is not a good word it's useful but it's not the right moment to do this right and that's that I think is the more crucial the more crucial thing is we've we have a short term need to be sure that we control Azure expenses and spending a little extra on AWS right now by not spending on digital ocean for the next few weeks is a good is a good compromise right that feels like a very good compromise my work around is just saying the biplane array to to use GIFRog instead of CICP it doesn't disable the use of digitalization agents right but the problem then is that will cause increase bandwidth at at GIFRog and I worry that I don't want them to come back in February and say oh your bandwidth was high in January exactly it just I don't want that disruption in February if we can avoid it yeah sure okay but so it's it will be yeah disabling the agents then great all together yes okay sorry I I wasn't sure we were on the same page no no no problem you did you did good to ask may I ask you to send that pull request survey and update because that's a quick and easy one and then we can move that subject on the backlog after the first time let's hold all this is there any other question folks on that topic oh okay then stephan your turn uh we add infrascii migration to RM 64 for the azure billing but slowly that's a long term task what's the status for this one yes I did move the two image but template image and it's not template but that content image content images that we are not I am 64 compatible I changed them to use the agent label instead and and using the all in one which is compatible with RM 64 so two and three are now using it it's all the terraform project that we're using on infrascii and the update cli which is used a lot and now they have they are using the the all in one and in RM 64 and now I started working on the builder image which is used to build Jenkins.io and reports I think if I remember correctly so this one is also used in ci Jenkins.io so first step I migrate the usage on infrascii to the all in one and then to the all in one in RM 64 maybe I will do directly in RM 64 and then I will have to work on ci to provide the new agent and and change on ci the pipeline using the builder image and then it's I'm so in private I speak way too fast no problem I was buffering you're good to to take notes really so that's nice job the hidden gain for the team is that with these two images first images already archived that's way less pull request to review updates to deploy and that was also the way for us we migrated to terraform oh yeah 1.6 is now used instead of 1.1 c'est so that was a good opportunity for us to do that migration that was clearly needed less peer to review that's so on that topic Stefan is that correct yes I think set issue of an equation so be careful to properly synchronize with the documentation team and usual contributor of the websites when you will work on the docker builder that's low priority so that there should be no action on the upcoming days but mention to maroc we know alex kevin yes the usual suspects that the the pull request on jen kinsayo on website might have issues on during the preview sites generation or if you see slowness except tomorrow of course because that will be security release but slowness on deploying new version of jen kinsayo on production that could be unwanted the side effect of this task yeah because if we if if we use the air and one that could also be also be the problem with the not pull not triggering the oh yeah good point that too so yes that can happen because we switched to air a l then airway two topics that your work is on Blobix fair command line replaced by az copy and update jen kinsayo to another cloud can you give us just see are we going offline we might not need to delay okay could you give us a summary yes that was yes so i'm currently i'm adding third access policy to the file share that we can pass as parameter to the az cli command to generate ss token doing so allow us to have a sure way of revoking ss token by expiring as related to red access policy i have a progress to pen for that then i will i have to add service principal application in private status terraform configuration to be able to then create another service principal credential in afra that's ci touchon kinsayo to to generate to use a service principal to manipulate the contributor touchon kinsayo file share from afra ci by loading with az and the service principal credential then using the az authentication with az copy to manipulate the file share i'm using contributors touchon kinsayo as first target as it has less less importance than the other services like plugins shavadock or else when i put this ready i will be able to apply that to other file share and also more importantly i think to the update touchon kinsayo file share that's total access policy plus the short lived ss token which we can set to be to expire in one hour or less depending on the job we should be okay security rise since we also confiance that we can't evoke existing ss token if they don't have an restored access policy thanks so as a reminder that one will also have impact on the plugin update from the update center that run regularly on the packaging machine that writes to one of these storage accounts same for the core releases so that's why starting with smaller services we ensure that we we take the experience and see how it behave before going to more critical workloads thanks survey so for update jen kinsayo does that mean that the new proof of concept that is not used in production should be one of the next in line to test this mechanism as well and the date or the above as a copy a short deal detail and of course the the one that stefan is going to to create might also be a candidate but that's decrylated for now it's just that stefan can you please take care of adding the policy at the moment in time maybe not on the first time but not sure to be to be to be talked on but creating the policy should be good so I let you think with airway when you create the module to see if the policy is working or not or if you delay the creation of the policy for later as one of the last step of the blomix for migration yes I will see that in my open question Azure if I can add a policy to an existing file share without any perturbation don't forget to prepare you have some puppets install as a copy on pkg yes not for contributor on javadok yeah but it will be later but that's that's a requirement for updates for get and for core releases any other question on these two topics as a reminder migration of the date center is delayed to february time for the security team to be okay and for us to have decreased as your meeting we have an outgoing issue incorrect email while it registers the I don't remember the status this one okay they never answer so I propose that we close it as as a not plan because we did never had an answer okay now update the mode then then what do we have uplink we had issues on the uplink system not blocking that's a tiny step by tiny step issue some element on the database for uplink are corrupted I was able to find one and deleted it so now the download goes further but it looks like there is at least one another record which is corrupted so each time it's using the economy so yes work in progress one corrupted record deleted from the table searching for another I'm trying to give a state of where we are for each steps I tried automation on a copy with way less data the 12 script I tried are all flowed and are not able to under the amount of data we have there and the rich time out or worse are generating invalid SQL requests most of the time it's because all we have configured the database for that application but I tried I listened to our way and clearly it would have helped but I'm sorry I wasn't able to make it work corrupted record so iterate this one is not causing serious production problems it's more of a great okay thanks so that's why it's a tiny step by tiny step so no need to spend too much time on this it's just the dichotomy that that run it takes six hours for the first steps and at the end it's 10 minutes so it's by tiny steps I've opened an issue because thanks to basil we had some cleanup on the way we spin up agents to remove the deprecative gene lp arguments so you already did most of the work it's a matter of us of updating as we gave details on sub issue and they showed that was the changes we would want the goal is to update for pet configuration to be sure that our controllers are spinning up agents with the new forms to avoid warning and deprecating message and avoid breaking the agent spinning up when a certain version of the remoting jar will be deployed without supporting these old deprecated arguments no emergency I propose that we wait at least one or two milestone on this one because yes that one looks easy but in fact once it will be deployed that mean we have to check for each controller or each of the free controller if they can spin up properly all kind of agents and we have to test all kind exhaustively otherwise we don't push that change so I propose to delay for later this one or later same for the question for alex about revoking up nvp certificates I'm not sure about the process here I will take the opportunity to discuss it with Olivier Vernand since he wrote the wall open vpn stuff but yeah not gonna spend some time here the request from alexis legit that will be good for us to revoke existing certificates I'm not sure that needs some investigation that's technically possible but I don't know if it's easy to do so I propose to delay for later unless someone has a objection one, two, three, no, okay Olivier can you give us a summary on docs jen kinsayo new progress since last week I prioritize other issue I still have to start by writing runbooks on contra retails and docs docs in kinsayo and given should we consider dropping this one from the current just admitting that right now van dieet sing is van dieet is unavailable doing university exams and waiting another week or two to do this I think is a safe compromise right is is van dieet won't work on it in exams and chris stern I think is fine if we delay as well kevin is actually able to do testing without this so his validation is not blocked by this as far as I know so I don't see any compelling reason for us to say we must do this it's great that everybody started it but I think we could safely say let's delay it it's fine for me is that okay for you as well fine fine too okay because in terms of impact on the project that one's actually lower impact than the next one on the list and I'm not even sure we should spend time on the next one between now and foster okay I can already write delay for later right yes just because I I think the the investigation is if we find a solution for it great but the azure cost controls are more important than this certainly we have two items first one migration leftover from public as to RM 64 so we still have systems such as ACP on public cluster still on AMD just for the person removing the markdown states it's four spaces for a sublist that's markdown the rest are non markdown compliant elements I was just putting the same intent that you put lineable but yeah it comes from the template in akmd I need to fix it but just the issue but don't spend too much you have a similar problem so we were able to work on LDAP LDAP so we had we had to update the LDAP system because it wasn't updated since at least 12 months in production so image update was an old one thanks to the work that both Stefan and RVD on the library pipeline library were used for building images thanks to docker bake support because it has two different images good thing is that these two images must be as a same life cycle they must be released and deployed at the same time with the same version so no not the same problem as whatever they faced with weekly and infrascii which has different life cycle but we still had we still requiring the support of having those two images so the first step was a migration now we are blocked on migrating LDAP to RM64 because RM64 require running in a different availability zone and the problem is that current storage which is a partial answer to the question raised by RV earlier today the let's say the live storage used by LDAP is a disk otherwise there wouldn't be any availability zone problem is in different zone than RM64 nodes so that mean we will need to do the same operation that what Stefan did for weekly CI we need to create a snapshot a geo replicated snapshot or zone zone replicated snapshot ZRS of the current data and migrate LDAP or use the LDAP backup mechanism and I believe that's the one in a storage accounts LDAP each time it stopped there is a backup which is done on a month point which is most probably store file share and not a disk so maybe we don't need the migration of the live data and instead we can only stop LDAP properly migrate to RM64 and reset the the disk and that should be fine so I propose that we delay for after first them and last item on the list Stefan on their way it's the exporting the download mirror list to a textual representation meaning having a kind of public light API for providing data of the infrastructure publicly Kat, what's listed? No progress accept that we now clear goal for this issue representation of only this mirror list to be as simple as possible for now keeping in mind that it will be expanded with other information data I believe let's remove from the upcoming milestone that's bonus I would keep it as now ever I mean I know what's in there yes but if you are able to spend sometimes on this with a low priority I know no no no it's not I propose we delay because if you are able to spend some time on this it's better you spend sometimes on the other Azure related elements that's what I'm saying et j'ai tendu à conclure avec Damian que c'est bien si on le laisse même si c'est intéressant et fun comme c'est je veux dire il y a déjà le bloc sur azikopie qui est important et le plus important il y a un soutien Stéphane sur la création de la nouvelle Gage-en-Kinsayou storage etc il faut c'est que ce sont les nous devons diminuer le nombre de tests c'est-ce que c'est-ce que c'est-ce nous avons un autre problème qui a été ajouté et n'a pas essayé non, nous n'avons pas de nouvelles problèmes donc c'est temps pour vous les folks, si vous voulez discuter sur les sujets qui peuvent être priores ou vous voulez mentionner eux même si c'est de la délégation d'eux après le premier mais il faut mentionner eux oui nous l'avons mentionné un peu plus tard que nous pouvons créer plusieurs applications pour éviter les limites de l'aéroport une pour chaque main job sur l'infrascié par exemple une pour les Kubernetes une pour les rapports et ainsi donc oui et ça s'entend plus vite l'installation de l'application à ses propres limites de l'aéroport donc l'application de l'aéroport installée sur deux différentes organisations correspondent à deux limites de l'aéroport et une pour l'installation de l'application dans deux applications installées dans l'organisation correspondant à l'installation de l'aéroport aussi bien ceci parce que nous sommes bloqués donc à peu près une fois sur l'infrascié chaque fois qu'il y a une configuration de reload ou quelque chose comme ça tous les jobs sont bloqués jusqu'à ce qu'on atteigne le nouveau threshold ceci est rapide je ne suis pas sûr si on devrait prendre le temps sur cela immédiatement mais cela signifie que nous pouvons être bloqués durant l'importance de production et de production ce n'est pas un danger pour demain bien sûr mais demain après le week-end qu'on a élevé et qu'on a déployé la nouvelle image on pourrait atteindre en déployant la nouvelle conversion et l'infrascié on pourrait atteindre l'application de l'application de l'aéroport à nouveau comme chaque fois on déploye la nouvelle version de l'application de l'application de l'aéroport donc ça veut dire qu'on ne devrait pas planter une operation critique jusqu'à ce que l'infrascié soit retourné à la normale demain donc ça veut dire que l'update peut prendre 10 minutes quand tout va très bien jusqu'à 1 heure ce n'est pas un danger pour la migration de l'aéroport parce que si nous atteins les limites de l'aéroport nous pouvons toujours tourner je vais pouvoir ou Stéphane je vais pouvoir ou Hervé je vais pouvoir tourner en Europe Terraforme localement et puis ouvrir le code donc ce n'est pas un danger pour la février et friday je ne sais pas pour la prochaine week folks donc je vais vous laisser le décider de ce avant quand je vais si tu penses que c'est bloqué c'est dommage de writer ça dans l'initial on va voir si on a l'air ou pas si on a l'air si on a l'air on a l'air beaucoup aussi sur les limites de l'aéroport et ne pas le souvenir si on on a parlé de l'issue de l'aéroport qu'on a avec Dockers pour Dockers raccons oh oh, c'est vrai la dernière la dernière en bohine d'agents les relises étaient incompletes à 429 erreurs de beaucoup de requests de Dockers et c'était misé à la configuration d'un autre côté c'est nous offrir de leur accès à l'exclut de consumers d'exclut d'exclut de l'aéroport et aussi ou de l'organisation Saint-François qui a temporairement mis de Dockers l'application l'application de l'application parce que j'ai fait un peu au-delà de l'addition je sais que je vais pour un nouveau futur gratuit pour nous de Dockers mais ils révertissent les changements et tout est maintenant au-delà de l'application de Dockers pour la sponsorship oui j'ai pris le bon moyen sur cette Dockers a confirmé nous sommes encore une subscription d'open source pour eux et c'est un très bon moyen pour nous nous avons besoin de Dockers pour continuer de sponsoriser nous et il y a un nouveau futur aussi oui oui et ce nouveau futur a l'air c'est-à-dire qu'il pourrait être intéressant pour nous donc, oui il y a un ou deux choses qui sont avec eux encore un autre securité scannique non non pas ça pas ça je veux évoquer la républication de l'immérité si pas pas un update de la servers sont parce que le casque permet de analyser finement chaque programme et la planète est installée dans l'image donc, on peut voir par exemple les tools opposites ont été updates ou pas parce que nous sommes réguliers périodétiquement construit notre image pour obtenir la dernière de par exemple l'APT il y a où on ne fixe notre version donc avoir quelque chose dire non il n'y a absolument pas de différence entre ce et le réveil précédent donc il n'y a pas besoin de republier un nouveau périodétique réveil c'est beau bon bon plaisir merci pour ça et en généralement relativement je vais aussi plus tard sur ma sur le temps regardé les suggestions de aussi publicer l'image docker sur l'application propositée en addition à le club docker on va voir bon je pense que nous avons atteint l'end si il y a d'autres soucis que vous voulez voir les vols ok donc ensuite je vais arrêter de partager l'application je vais arrêter de s'occuper donc pour les gens qui nous regardent vous dans deux semaines non la prochaine semaine sorry mais pour moi dans trois semaines pour vous bonjour tout le monde bien