 now. Hi everybody, so welcome for this new Jenkins and Frazier tour meeting. This one is a little bit special because of the major outage we had last week. So I'm going to give you a quick overview of the different components that were involved in the outage, what they are doing and so we'll discuss as well about what we can do to prevent such thing to happen again. So the first thing to understand is we have the way the Jenkins Project Distribute packages contains four different services that we name Updates Center. So Updates.jnkens.io. So Updates.jnkens.io distributes plugin versions. So when from your Jenkins instance you want to update the plugins or update Jenkins score. This is the service that you are reaching. So basically this service is generated every three minutes and so it contains a list of every plugins that you can install for your Jenkins version. So every three minutes we generate the list and then when you want to install the specific plugins then you are ready rated to the mirroring service. This is something that I will explain in a little bit. The next major service is package.jnkens.io. That service contains the distribution packages. So when you want for instance to install Debian package or Red Hat package you just run happy to get installed Jenkins and so the instructions are located on the package.jnkens.io and it contains the metadata information that you need to install Jenkins on your service. The same then the update center if you need to download the package it's not downloaded from package.jnkens.io it's downloaded from the mirroring service and then we arrive to the two mirroring service that you know. So the first one is the deprecated one the old one which is mirrors.jnkens.org. So this service has been deprecated over the last summer. The reason to that is because it's rely on an application called mirror brain which is not maintained since a while and still use Python 2.7. We had issues to scale it and so we decided to remove that service by the one that we had issues last week which is get the Jenkins.io. So get the Jenkins.io use rely on two applications so the release database to store the indexes the file indexes the file hash a list of ashes for every files available that you can download and obviously we need the files that we are distributing and those are stored on an Azure file storage. So basically what happened last week is the volume the release database crashed. We had issues to mount it in the release server and so because the that release database was not available it forced the get the Jenkins.io to restart. We fixed the release database by deploying it on a managed service because we're not able to access the Azure disk used by the database for some reason and so we deploy a managed release service and so after that we were not able to use the Azure file storage used inside get the Jenkins.io so mirror bits has no way to know which package could be distributed to the user and so obviously because the mirroring was broken the update center and the package Jenkins.io could not work correctly so that's that's basically what happened to us last week. So we open an Azure support tickets to ask why it stopped working so the service get the Jenkins.io has been running since March last week and we never experienced such issues so it's really new to us and we still don't understand why the volume is behaving like that so it took us a while to understand that the volume was broken because some files could be retrieved and others not so typically distribution packages files like the Debian, Red Hat and so on are available but for some reason when we try to read the plug-ins information it send us a timeout issue so we are not able to read the data we receive many different errors and so basically what we decided to do Saturday morning when we realized that we would not find a way to retrieve access correctly to the to that Azure file storage we decided to reuse the mirror of the Jenkins.io service so the same machine that already have every files that we distribute in our infrastructure so we just decided to redeploy mirror bits on that machine so that's what we did until now so the service is back is working again the problem is we usually use automated workflow to update the application and so in this case we just manually deploy mirror bits on the service so the good thing is the machine running mirror bits at the moment is big enough to handle the load but we still we are still at the moment wondering if when we will be able to use Azure file storage if it's the right service that we should rely on and so yeah we still have an open Azure ticket at the moment and we are still discussing with Azure support to know what's happening so any question until now before I continue to clarify the problem is still ongoing and we still haven't heard from Azure support in any useful way yes exactly that's the point so we had a call with Azure today so I spent the afternoon with them to try to identify and to replicate the as I said it's really weird because I can so it's a so Azure file storage it's a CIFS volume so I can mount it on my machine I can mount it on my machine I can list some files I can modify create files but the directory slash plugins does not work if I try to access it I receive a discord but if I go to the portal to the Azure web interface there I can create files in the slash plugins directory so it's a really weird issue and yeah that that's something that we're still trying to understand why it's affecting us did I answer your question Denim? Yes thank you so right now so the service is still working so that's a good thing the question is could we have prevent this to happen I don't think so because we had issues with network storage both for the radius of volume and for the get the Jenkins array of volume we had network issues and I mean we don't we don't we don't we don't use those I mean we don't manage the service those are provided by our Azure accounts but what I think we could have done better is first to communicate about the outage I was in communication with a lot of people over the last week and over the weekend as well about what was occurring state why we were having that issues and so on and apparently it ended up that Twitter was the right support channel for that maybe not the right but yeah I think we really like a central way for the people to understand what was the current situation so one of the things that I've been wondering now is about having a status page for the Jenkins project so I already shared this previously but there is a project that could allow us to to generate to provide information to a state page by providing markdown so we could say okay we have a plan maintenance coming we have current issues and so if you have any question feel free to redirect to those locations so if you want I can do a quick a quick demo of what I was able to provision this morning any suggestions on this topic so it's not due to me why we would do that I mean it appears more professional and but Twitter as long seems like a very low barrier to actually post something as long as we have anyone available with access to the Twitter account for example during the EU night you're offline Tim is offline as long as we have one of the social media folks around we can say hey we have infra problems could you post a tweet nobody needs to know how to use it and a self hosted status page sounds a lot like if we have infra trouble the status page will be offline as well and we'll just post a tweet anyway so yes it's not as fancy as you know githubstatus.com but I think as long as we make sure that once we as soon as we confirm there's a problem that we post a tweet I think that's fine and also that's probably where people look the problem with a status page is we need to advertise it and make people aware of it and if nobody knows it exists nobody has any benefit from it so I totally agree with you about the fact that people first look at Twitter and it's easier to just post a Twitter message and I also agree with you that if there is an infra outage we'll probably be focusing on the infra outage instead of creating a markdown or notification and working on this this is something that I mean this is something that takes time I just I have the feeling this time that I received the notification again right I'm using Twitter I'm using LinkedIn I'm using Reddit and I received a notification by email by RC by Twitter I mean I had the thing that I was just receiving way too much information and I could just not answer all of them and so I was just wondering if using a status page could just say okay you know what I just provide one information and just look at this so the status page one that Olivia posted earlier you don't run it on your own infrastructure it's got a one click deployed in Netlify for like free deployment sort of setup so it doesn't touch any of your other infrastructure so status page so so that Tim Tim and Daniel so that's a good point we obviously need to be sure that the status page is not running in our infrastructure because obviously if we are having issues with our infrastructure I mean it can be problematic for it the way you had phrased that sounded a lot like it would be part of our infra and it sounds kind of silly okay yeah but if that's not a problem still I mean I saw some of the tweets and there was an announcement posted and 20 minutes later someone posted hey any updates and I'm like dude what what do you expect and I really doubt that this would just magically go away by us having a status page yeah but then you can just link the status page rather than having to repeat yourself yeah I mean yeah I think I mean we don't have to work on the status page I think it may be a nice to have if we put one we definitely doesn't need to be in our infrastructure I just want to give you a quick overview of what I did this morning so just this just mainly an example from the CST project and so it just puts a mark down like we would do for the change in spread Jenkins radio website and so we can we can specify few information in the file like what are the tags so we could say in this case okay we had get the Jenkins radio issues these affect those websites and so we can provide the description of the attach with link to twitter the google discussions and so on and so we could also be able to to list all the issues that affected that specific service in the past so it's more like it's more like it's not a monitoring tool it doesn't detect if your service is done we can we could for example say okay we have a maintenance coming in the coming weeks let's say we know that the beginning of December we want to do a maintenance we can we could plan this in advance and notify that but it's just I think it's more we can more see it like a way to communicate about what are the major things but obviously we want the idea is not to slow down our processes but yeah there's these suggestions that I was doing for I like the idea for us being able to use it for things that would not make the twitter account so that would be a real benefit like if we're doing perhaps a migration of sorts or the work that Tim and I were doing on updates under two that would not qualify I think for the twitter account because we have no idea whether anyone cares or whether anyone even notices but ultimately we some people noticed and if we just just had had the status page there that says we're tweaking our infra please report wherever if you notice problems I think that that would really be nice so I can see the benefit there I think it'd be easier to get updates out because we do have limited twitter people and time zones don't always line up whereas at least here you could have a fairly liberal merge policy on an incident because you don't really care who does it as long as someone gets it done and I know so another thing that I like with this specific status page is it allows us to share data dot dashboards because right now when I need a specific dashboard I created but there is no way to have an easy remember a URL and so we could for example let me share my screen again sorry are you still there yes you are you see my screen still black now it shows okay perfect so basically so let me first show you what it looks like the way we configure it so Jenkins infra so it's one major configuration where you can define archives so you can you can you define the different tags so for example I say okay I have the tags archive the Jenkins.io which is running in rack space I have adapted Jenkins.io so you can specify a bunch of tags and so you can filter for specific event like this but you can also provide links so if you go back here we could for example say here there is a link to our monitoring solution and so you're automatically redirected to data dog obviously in this case it's not useful because it's not publicly available but we could have we could have for each services a description of what the service is doing so adapt the Jenkins.io is our adapt service but we could also provide links to dashboards that could tell people is it working do we have like I mean any information that could tell us if it's working correctly because that's also something that happened to me over the weekend is because people knew that we had issues with Jenkins.io people sent a lot of requests asking for help saying okay right I have a network issue and the thing is the service was back but because I knew that I had an issue with Jenkins.io I spent quite a lot of time each time to investigate if the problem was on our side or if it was just a random network misconfigured and so I think we could use this kind of service to provide information to the end users so yeah again how are you thinking of deploying it just using Netlify or something else so the thing is that's that's another point that I was thinking to bring so Netlify as an open source plan that we could that we could leverage oops sorry that was not the right so Netlify as an open source plan that we could leverage and so we can already use a free tier so it just again as Tim mentioned it's just one minute we just we just defined the configuration that we need and we pushed to Netlify have been using Netlify for my projects and it's working great and if we want to but the thing is it's just basic features and if we want more like analytics and stuff like that then we have to use the open source plan but as a first iteration I would just go with the free tier yeah makes sense to me I don't think it's much work so if you don't have any questions I create a geraticket to keep the track of this work but again this this won't be the priority just a nice to have so if someone is interested to help this one I will just create a git repository and we will iterate on this one yeah it's great to get repo and let us know yep the next the next thing that I've been wondering is how we could have detected this outage and this is something that I've been wondering for a while is how we can monitor that the latest packages are available and so I really started working a while some sometimes ago and I just finished this but the idea is to have a data custom checks that test what are the latest stable and weekly releases and then ping the different endpoints on get the Jenkins that I use so if if we just release a weekly release we assume that get the Jenkins that I should be able to return the appropriated packages the custom check is done I just finished it today I think so I'm going to enable it so the idea is more to detect the issue that we had with get the Jenkins that I use sooner because typically what happened here I think it took me two hours before being notified about the outage and I think it's yeah it's quite a lot yeah I think it came up an ISE I'm not sure yeah we really I mean on this specific topic we really rely on user monitoring someone complains and then we investigate and that's it and it's not acceptable because the service so basically what's what was weird in this case is we had a lot of side effects about the outage because one of the one of them was if get the Jenkins that I was not available could not use any mirrors then it fall back to to itself and so the thing is mirror beats do not provide do not provide the files it always redirect you to mirrors and so we just deploy an Apache next to mirror beats a service that just allows you to browse the different files and to provide you but because all the mirrors were done we just redirect the full traffic of mirrors to our own service so we were just literally down and yeah so we have this on the side to that do we need to change the fallback to archives because I thought that's the OUSO fallback we're using right now doesn't have all the artifacts on there so that's a good point I think it has every pregame so it doesn't have the old Jenkins version but it should have every pregins I increased the limits that archive can can accept so we could we could we could use it instead of the current mirror we could use to be to be honest to the mirror beats is running and I just don't want to to stop it right now so as long as it's running I would like first to find to understand if we bring me to the next point if we stick to Azure file storage on that if we consider that service reliable enough because yeah it's been a few days and I still don't understand why it's broken but the problem is the way we've released version rely on the Azure file storage because we use specific tools to publish the fighters and so if we realize that we don't want to use Azure file storage anymore then it means that we also have to update the release scripts to push somewhere else and the thing is right now we are pushing a lot of things to a machine which is package Jenkins.io and ideally I would like to remove to split the responsibility of that machine because right now it does way too much things and each time we modify one component on that machine it affect other services like update center and in the back end update center yeah and some specific other scripts so right now we are using that machine temporarily but ideally I would like to use a different solution so if it's not Azure file storage we have to think about a different way to basically to use it. Any question on this topic while we are discussing about the advantage maybe you have other ideas I'm really hoping to suggestions about the different ways that we could manage this outage. To me I think it will depend on what the response from Azure is and whether we get any use anything useful from them. If they say if they basically shrug and say we don't know what's going on now obviously that would not give me the confidence that I would like to continue using it or otherwise if they say yeah we've identified the problem it's extremely unlikely to happen again I would be more inclined to continue using it so basically I'm saying I don't I would make it dependent on what they say and they have not yet told us anything and of course that being for this now is also a problem of course. Yeah out of that I was probably on support to get them priorities on that and we are not sure if we chase support much and then we didn't have a critical case open and those sort of things so if we had a critical case open you get a 24-7 management and we didn't do that. So is this a case where we ought to admit we're going to spend more money purchase the Microsoft support and stay with Azure as the lowest cost option for now? Olivia I think you had to purchase some support privileges some sort of thing. So yeah that's something so when Microsoft stopped sponsoring us we lost access to the supports and so we had so the support that we had before DOTH was just having access to the document so we could not open a support ticket. For some reason I was able to open a support ticket last Thursday regarding the Azure disk volume issue that happened for Redis and then when I wanted to open an Azure file storage issue, then it said that we didn't have support basically so for some reason. So I paid for the support plan so it's $100 per month so I paid it last week and anyway I think we will not be able to move away from Azure anytime soon because if we decide to move away from Azure I mean we still have we are still using specific services and we would have to update our scripts. Just for example for the Redis environments we are using Azure tVolt for instance. It's not a big deal if we have to switch to something else but yeah it would just it means that we would have to work on that instead of working on something else. It seems like that says we ought to accept the increased cost and include that $100 a month in our budget. I think you mentioned $100 versus $1,000 a month. Is there a significant given the pain that this caused the community etc is it worth $1,000 a month? No, no because and also what I think is right now we are spending around 8,000 per month on data accounts. $100 is not a big amount but I would not pay it $1,000 if we just pay I mean if we pay 8,000. So I think for now it's fine to pay the $100 per month. We also have some we are also not paying for our Redis database so it should be $300 per month. I will scale down the instance so because I was in the rush I put a big instance just to be sure but we don't use a full so I have to scale down the Redis database that we are using right now. So yeah I think per month it should be like we should increase by $200 or $300 per month so it's not it's not a big deal. So on that side I think it's good. Okay so it seems like one action out of this is we accept the the increased cost and willingly accept the ongoing support payment to Microsoft. But I think we are now spending more money again on an Azure account so we would have to spend some time to to reevaluate how we can save money on data goals because we still have a hard limit of 10,000 per month. So we have to revisit expenses for sure we stay in budget. Yeah while yeah that's it. Any other question regarding this outage? So again the service is not back again so I'll send a communication once once once we know a little bit more about what's happened with the Azure 5 storage if we could have prevented it and what would be the the downside decisions about that architecture. So there were a couple of things that I realized I thought I knew how to access the cluster and could not and I was one of the people during my time zone that should have been able to so I'll take the action to get my education improved again. Sure we know how to do that in the in the West Coast US time zone basically. So we've got a little more coverage and instead of just relying on those view in Europe. Yeah something really important in this case is because Dior was not of use I had to face with a lot of weird sites like timeout issues that were not supposed to be there. The container that would not start because of mounting issues and stuff like that. So yeah it was it was a tough one to diagnostic. Very weird that we've got two issues. Sorry but I guess one of those issues had been going on for quite a while it just read its head on maravits that die but then we also had file storage issues. So yeah we are already over the units so what I propose is we cover the different topics or maybe you prefer. I think I would skip the topic I had proposed Olivia will wait till next week but the topic we've been covering was much more important. So then I propose to stop the meeting here and we will cover the other topics. Just one thing before we leave. Obviously the fact that our business was down it also affected the weekly release happening today because the release directly pushed components to the Azure file storage. So it's available for get the Genki.io and we faced a really weird issue again with the file the same Azure file storage which said permission issue on the file which could not happen because it's a CIFS volume. But yeah so basically I had to modify the pod this afternoon used by the release environment to not mount the Azure file storage. So if we decide to stick to the Azure file storage that's fine. We'll have to revert my change and if not we'll have to slightly modify the release environments. So the really good thing is right now because we still have the process to push to the way we were doing previously so we still have the process to automatically push new components to package the Genki.io which means that we have a full back situation. As long as get the Genki.io running on Kubernetes is not back as long as we don't have access to the Azure file storage we are still able to rely on the machine running on Amazon. So unless you have one last question I propose to stop the meeting here so one core one. Quick question how urgent is it to resolve the situation or could we in theory continue as is indefinitely besides the fact that we now have a single point of failure here for distribution. So sorry so can you repeat the question? We're currently on a fallback situation is that something that we can live with if need be for several weeks or do we need to get to a real better situation or the Azure stuff as soon as possible? No so the current situation can work for weeks even month because the machine is the same machine that was used for Mirrors.Genki.io so we know that the machine can handle the load so that's a good thing. The service is running it's easy to configure that's a really good thing the problem is that machine is out of sync with our pipette master and so I mean if we decide to keep the current situation then we would have to work on the pipette code to automatically configure that machine because it's a virtual machine so it's not a Kubernetes environment so the way we configure the service is slightly different and so yeah so basically we need to have access we need to have access to the file storage because Mirrored Beats need the files to know what are the ashes for those files and based on those ashes it says you can be redirected to that specific URL so that's why we need a location containing every component so to answer the question we could stay for weeks what's really fear me is when you do manual procedure on the server if the person who did the procedure left or is not available then you end up by trying to figure out what would need to be done basically so it works but I'm not comfortable to keep this situation okay thank you so I propose to stop the meeting here thanks for your time see you back on RC and yeah have a good day bye-bye