 Okay. Hello everyone, welcome to the weekly Jenkins Infrastructure team meeting. So today we are five persons, we have a guest, Tien Studer, we have Stéphane Hervé, Marc and I. Hello. So let's get started, first announcement. So the weekly release to point three to five has been released. The checklists have to be completed. The main element if I remember correctly, oh it's written, perfect. Plug-in manager UI, big changes to improve the user experience. So yet, don't know if there are other major changes, Marc? A number of internal improvements that have been queued and working through. So lots of refactoring that happened, lots of warning improvements, that kind of thing. Okay. Thanks. So that should be available on how you are, let's say staging and test environment before end of day. Are there any other announcements? Not, okay. So let's start. So the first point, welcome, Tien. So I will let you the mic. We will discuss about the subject of accelerating CI GenKinds.io builds and tests, mainly. We is using gradle enterprise shoots. So Marc drive the subject and the channel guest is here to show us things. Great. Well, thank you very much, Tien. So I'll share my screen. I have a few slides that I would like to share to give you a bit of context, but I also want to give you a bit of a demo as well. So I will be, and I'll try to keep an eye on the watch that we don't own out of time. So my name is Tien, I'm from gradle, the SVP of engineering of the gradle enterprise team. I have a French name, but I'm from Switzerland and I speak Swiss German usually. So please don't ask me in French at the end of this presentation. I might get a bit in trouble. So first of all, what are we trying to, what problem are we trying to solve? Just give you a little bit of context before we dive into specifics of gradle enterprise. So what we often see for years now already and still see it is three different things, most many others, of course, things take too long, right? We make a small change. We expect to get quick feedback, but we wait way too long and we lose the flow. We switch context and we also accumulate more changes, more potential for failures because things take too long and we start doing something else. Things also take too long to fix. So it goes for local developers or for engineers, but also for infrastructure teams or build teams. Because if we don't know what is going on, what went wrong, it's really hard to fix it. At first, you need to know what is happening without just guesswork and without having to reproduce what hopefully will happen again in order to debug. You want to jump right to, oh, this is what happened. Okay, I know how to fix this. And the third category is more like, is the category of things could have been prevented instead, you run some builds, it fails, the developer comes to you and says, yeah, of course, this happens from time to time, just rerun the build and you'll be fine. It could be a local build, but it could also be a CI build. I'm sure everybody has experienced that, that we're just clicking the run button again because we know sooner or later it will pass. This is all creating inefficiency and making us less productive. And if you add that up for a whole year for multiple developers and that doesn't matter if it's a commercial company or an open source project, it really reduces your productivity significantly and it bumps up your costs, wasted costs. So with Quail Enterprise, we can do different things and I'll show it in practice in a second. First one, we can really surface what happened in your build because we collect what happened in your build with great detail, whether it's a Bazel build, a Maven build or a Gradle build, does not matter, it works exactly the same way. And we can capture that data while the build is running, upload it to the Quail Enterprise server and then that data becomes available to be visualized, to be analyzed and so on. And we can do this for local builds as well as for CI builds. That's an important point because if you just have the server, yes, then you can look at the CI builds, what happened, typically based on logs, but we can do the same for local builds and it's not just based on log, it's a much richer model. And what that allows us to do is we can really do a fast root cause analysis, what was going on like I mentioned, and then we can start addressing it. You can also collaborate efficiently. If something goes wrong and it helped, for example, from the infra team and I built locally, I don't have to explain what I did exactly and what machine was I using and so on. I can just share the data visualized and the infra team, for example, will be very quickly understanding what was going on and they can focus on the solution and not on what was I trying to do, right? The same if I want to collaborate with another developer and I'll show you in a second and some in the case of investigations, even if it's on CI, I can do on my own. If my builds failing, it's passing locally but not on CI, well, I can already take a look at what happened on CI in the visualization that I mentioned and it might be, oh, a new snapshot dependency was pulled in and that was causing my problem, right? I don't need to ping the infra team to ask for a statement in such a case so that it also saves some resource. And also, especially I think in the open source world, everybody's time is limited. Except those that are doing it full time, they do it in their spare time and if they tackle something, they want to do this efficiently. They're not going to say, oh, I'm going to investigate this nasty issue but I already know it's going to take so much time so they might not even start that. So let me just show you that aspect. So that is what we call the build scan and you can see it here. It captures a lot of data, visualized it. You can also export it via an API and it provides all kinds of details and I just picked this from an open source instance that is the xwiki project and what you can see is I can dive all the way into, for example, dependencies and I can quickly see what was going on. Oh, this is what I'm pulling in. It's exactly that snapshot version and that might be the cause of my problem, right? Or I might not know, so I can just take that link and I can share with others, right? I can take that link and I can say, hey, I don't know what the problem is but my colleague might know or maybe somebody from the infrastructure team, I just shared that link. You can go there and you see exactly what I'm looking at and you will see much more. I do will see with what hardware did I run it? What switches did I have on? What were actually the actual goal that I was running and so on. So all these things are already answered and you can quickly dive into the details. What was the build doing? Of course, that includes also testing, dependencies, which plugins were applied. I can even add my own data. And of course, the log itself is also available, but that is the least interesting because it's basically just the plain text. But even here, I could link, I could share, ask for help and really efficiently collaborate with others. So that's the insight spot. Then we move on fast feedback site. I think everywhere does development. Notice that situation, you make a small change. You want to get some quick feedback but usually you make a small change and you have to wait quite a long time to get the feedback whether that change was sound or not. But if we can achieve that the feedback cycles get fast because the build is doing less work because it only does the work it needs to and not all the work all the time, we can run or build more often. We can make smaller changes and get the quick feedback. Was that a good change or not? And because it's so quick, we'll also do this more often. And if we do things more often, this also puts less constraints or less strains on CI. But not only for local development, things get more fluid. Also when you push those changes to CI and you want them to be deployed to some testing server or so, some staging server, if the build is faster, it will, the change will go through the pipelines faster. In the end, be deployed faster to a staging environment or even to a production environment. And one thing that I find extremely important, especially in the open source world, is if I'm a contributor and I want to make maybe just a very small change, maybe even a typo somewhere that I saw in the Java doc or in a method in the API, I want to make that change as quickly as possible. So what do I do? I check out the project and first I build it. And after I build it, I'll make the change. But if it takes me, if it takes to check out this class, but then it takes me like half an hour or an hour to build the project. And I haven't even touched the project yet. I mean, I'm sure many people will just stop there and say, ah, I'm not going to contribute. It's not worth fixing a few typos or even making some small refactors to the code or a feature. If it takes me so long to even build the project the first time. And that's also something we can address where we, and I will show you a concrete example right now. So faster feedback cycles, one way to approach it is by saying, let's not do work we've done before. And that's on the level of a goal. So if I've done it, executed the goal before with exactly the same inputs, so the same sources, for example, the same JDM version and so on, depending on the goal, of course, I can reuse that output that was produced instead of rerunning that goal. And we see this, I took as an example the Spring Boot build. I ran this locally a few days ago. I had checked it out fresh, right. And many things did not have to be run again because they've already built on CI. They've already put in the cache by CI. You can see it here, right. And what that meant is that the whole build, to execute all the tasks to two minutes, right. The full build is about 40 or 50 minutes. So if I'm also, if I'm looking at how much am I saving when I ran this locally, I saved almost three hours. Why? So the build will not have taken three hours. That's because they run the build also in parallel with multiple threads. But still I'm going down to two minutes from something like 40 or 50 minutes. And that is a really low bar for me to say, yes, I want to contribute to that project if I just want to make a small change and I don't want to make a big investment of running a build for an hour. Maybe it even fails after an hour just to make a small change. So I think that's a, especially in the open source world, an important thing to keep in mind. So then lower CI resource consumption, that is in my opinion a nice, almost like a side effect. But I think it's a very important one and I see this more and more happening that just scaling horizontally at some point becomes really expensive. And so what we can do or what happens is if we skip goals because we have executed them already with exactly the same inputs and then we save ourselves the resources of recompiling or rerunning the tests or regenerating Java doc or regenerating source code if we generate source code dynamically. And so that saves resources, right? If we can skip tests that we don't need to run because we know these tests are not going to affect or not going to be affected by the change I made while we again save resource because we just skip work. Also, if the build feedback cycles get faster that I mentioned before and we now start running builds more often locally instead of always just pushing to CI and say, ah, CI is going to take care. We're also putting less weight on CI. And as a nice side effect of that as well if we start building more locally what we will push to CI will be higher quality before you test it locally at least verified to some degree, right? That means it will have less red builds on CI meaning red builds need to be rerun after they're fixed, right? So we can save ourselves these extra cycles of red builds to some degree if we start building more locally and having a higher quality that we push. Another effect is if we get our builds more stable, right? If we have, for example, flaky tests every time we have flaky tests and that runs on CI and it fails, well, we have to rerun that build the entire build typically, right? So we can also reduce hardware consumption in that sense if we make our builds more stable. And so when the test should be green it's always going to be green and it's not going to fail sometime, right? And again, I want to show an example on that part. I'm using the spring project again so they also use Squared Enterprise as an open-source project where we sponsor an instance. And you can see here for the spring boot build all the CI builds are filtered by those over the last 28 days. And you can see they also have some flaky tests. I think every project has flaky tests. I've not seen one that didn't, but that's okay. And usually we have more flaky tests than we have time to fix. So which one should we fix? We need the data to make a good decision and that's what we get here because we collected all the flaky test data for all the builds so basically the test data for all the builds we can also see which ones are the most flaky. And I can slice it and dice it as I want, right? I went by CI, by project but I could also use other dimensions. I didn't see out this one is the most flaky. It usually runs for 12 minutes. Oh, that's pretty expensive every time it runs and it might fail even though it shouldn't. That's it. So it ran for 12 minutes just to then fail when it shouldn't fail. And in the end the whole builds fails. And so if I want to address this problem I will probably address this one first because it's the most flakiness. It is the most flaky of all the tests they have. So I have the data. I can make an informed decision. I can make the fix and I can then verify it again. And then hopefully I see an improvement. For example, you see it was more flaky in the past then it went down but it's still flaky sometimes but most of the time it's green. So I can even see the trend is it getting more flaky or less? All right. So then going back just also how do you install? I get from the infrastructure team you're also interested in how do you run this? So basically you have a great linear-price server. It's piece of software that you install or rehosted in that we can discuss that. And it comes with all these components so the build scans that's what I showed in the visualization. We have the build cache. We can also run tests distributed across many machines. And we can also do analysis over many builds. That's what you see or what I showed you with the test analytics for example. And on the client side basically whether you are on CI which is in that sense also client of great enterprise or a developer the build just points to great enterprise captures the data while it's running pushes it to great enterprise at the end and that's it. Caching happens work similarly the developer runs the build and the great enterprise checks can I use it from the cache or not? And if so it pulls it from the cache. And on CI you not only read from the cache but you also push to the cache. So local developers typically don't push to the cache usually that's disabled that or the permissions are not granted. So only CI pushes to the cache but everybody can reuse. You can reuse it in your pipeline you can use it in consecutive steps of the pipeline and developers can also benefit. So for example I come into the office in the morning I run the build like I showed you with spring and it took everything from the cache and that will be basically always happening when I do a git pull and have no local changes everything will come from the cache. They have some changes only those things that are not affected by that change will then be taken from the cache. So yes I showed the build scans I want to just emphasize that the caching great enterprise happens on the level of a goal so not on the whole build not on the level of a git commit it's on the level of a single goal so it's very fine-grained but that means with every build you will get some caches and the less changes you made you typically the more you get from the cache. Test distribution that is of course not a component where you save resources but if you have to run tests because something has changed you can distribute the work across multiple agents and then collect the results or will be collected and the build continues of course much faster. So if you have 10 agents each one runs 100 tests of 1000 total tests well you can you can expect that your overall test goal will be done 10 times faster. And then we have predictive test selection that's that's something we're working on we were releasing the first preview this week action that's the idea well if I make a change I don't need to run all the tests I only need to run those tests that are very likely to be broken by what I just changed because if I just run a test that is not even affected by that change well I can be saved myself running that test and that is based not on like code traversal or so it's based on historic data we have a very high hit rate and it comes to consider considerable savings right if we don't have to run tests unnecessarily of course you can conditionally turn it on and off so if you say well I have a release build there I always want to test or run all the tests yes you can do that but you could say like for get pull requests there I don't want to run all the tests I just want to run those likely affected by the change. How is it deployed? It's deployed into Kubernetes cluster it supports horizontal scaling it runs at some of the biggest companies or software engineering teams in the world you know like Netflix and LinkedIn and several others also banking banks and so on with thousands of users so it scales really so that's not going to be the issue you can also use an external database instead of an embedded database so that's also possible and you can run multiple cache nodes so if you have people on different continents you can also say I have multiple cache nodes and they can even replicate so you could say see I push this to the cache in just making this up in the US and then it pushes all the entries automatically to the one in Europe the one in Asia and so when people in Europe access a cache node to get some entries it's already there they don't have to get it in the US so this is pretty advanced what would you have to do on the project? I talked with Mark about using the Jenkins core project to get started all you would have to do just to get up and running is well once the server is running achieve those Jenkins.io add extensions.xml where we refer to the Great Enterprise Maven extension you add a great dash enterprise.xml point to that server you can also configure a few details about what you exactly want to capture and so on but that's it you check that in and from then on every build on CI or locally we'll capture the build scan we'll publish it and you get all the analytics you get the build caching etc and I mentioned on CI you would also want to push to the cache so whenever you build on CI push the artifacts of compile and test and turbo talk and so on and so on and so there you would also inject the credentials to write to the build cache while reading you would leave anonymous access so everybody can read the entry but not write so and that's pretty much it but we would not just say here's the server or here's the license and now they connect your builds and we're done and we would really help you get the most out of it and this maybe looks a bit I don't know scary but it's not there's an installation part then it's configuring the build to connect to great enterprise that we would measure where we are today but this is really more for the commercial side I mean but still it would be interesting to see where are you today then we would help you optimize the build to get the most out of the build cache and we can measure then again we can measure a delta and that's it well all right that is more like on the commercial side but that this is really the idea we use the tool itself to measure its value and it's not based on qualitative value it's or qualitative metrics it's really quantitative metrics but even then you would not be done or I mean it is not considered done because it's like cleaning up a room you have to clean up from time to time otherwise you're going to have a mess again I didn't see the same here so we also have some some visualizations to make sure where's my build time going is it going up and down how am I using the build cache and so for what are the most common failures for example you have a somebody says oh I have an error on CI I have this every day you're like you're sure I don't think so well then you have this guesswork with quite an enterprise you can find all the builds with similar failures and it will tell you ah it's only on that one agent or it's only locally by that user or whatever it is right so again we will have data to make informed decisions and if you're not sure you can always share use the links and share with your colleagues you can also compare to builds to see why it's not something taken from the cache when it should be and it will give you a nice diff and from there it's usually really easy to find out because the key is always knowing what happened because once you know you can react to it but you have a lot of customers so but that's not relevant in this case but we also have other open source great land price instances running right and this is the current list so you see spring was one of the first chapter in Scotland compilers using each unit Spock test containers a lot of project in the in the testing space we also have micrometers also hibernate and also the micro note foundation they're also using great land price across their 80 projects yeah and some both listed twice and so on right and some use it cradle some use it with maven it works exactly the same way we also did it for the Jenkins project so I asked Mark well what would we do first and he said we will do the the core project first so we ran an experiment we took the project we checked it out we ran the full build with tests so and we tried to use the same configuration that you have in your what we think is your configuration that you run on ci and it took an hour and 12 minutes and I can show you the build scan and then we ran it again with great land price enabled we we added some additional annotations to make it clear for great land price what are inputs and what are outputs and in the best case scenario where you run the build twice and you don't make a change between it goes from one hour 12 minutes to one minute two seconds so again we can we can measure that so let me show you you can see here to finish off my my part here it was using the build cash and ran not many goals because many came from the cash so if we look at what came from the cash these are all the things that came from the cash all right so that's what I briefly wanted to show here as well that that the savings you can expect are very significant and that's pretty much what I wanted to wanted to show you today so I'd be very happy if you're interested in proceeding from here to try out great land price with one of your projects so thanks very much at the end I'm I'm especially delighted that you ran the experiment with the Jenkins Core that was now now I'm interested in the behavior on as soon as a so the 72 minutes to two minutes really great that's marvelous now the question then is when the first change arrives that's the place where I was most interested in all right somebody submitted a poll request to change some piece you noted that would that would then take some advantage of the cash but of course it'd have to rebuild some pieces and so it's certainly not going to be two minutes the hope is it will be somewhere between 72 minutes and two minutes and we get benefit from the cash yep yes exactly so great land price we'll only build what what is needed to be built so if you make a change depending on what that changes how rippling it is it will have to do more or less work that's correct so it is the best case scenario but I would still keep in mind that this best case scenario also happens quite often like I said if you check out a new version from Git and you commit state you have no local changes you build it it will apply or you run the the GitHub the poll request and then you merge it and you run it again on master it basically have the same code you run you run the same thing it can just take everything from the cash so there are still many scenarios where the best case scenario kicks in and even if the best case scenario doesn't kick in it's somewhere between the two minutes yes and the hour and 12 minutes in the case that we measure to yes and the great land price will tell you like we have a performance dashboard where you see over the last for example 28 days on CI how much savings did we get from the cash so to the rest of the team I assume ge.jankins.io is a thing that you hosted and that you could host for this prototype to Damian and every I assume that would mean we would have to do something in terms of DNS to to allow such a thing what other questions the rest of the team have I I've run out of my questions others okay so thanks for for that presentation that's really appealing so I have also a few questions so the first one we have a specificity on the case of CI jankins.io is the fact that since it's a public facing jankins instance and we don't have mechanisms such as the GitHub token on GitHub action we consider all credential on that instance potentially compromised every day that means everything that is built so all the the CI pure CI no generation of artifact that are published somewhere all the build and test only phases are provided to developer for the feedbacks that means developer cannot push the cache that will be too dangerous for us given the amount of contributor we have there so that means the cache generation is it possible I understand that yes based on your presentation but I will want to be sure that I understand correctly that mean we could have a private instance that run the source code so it's something really private and secure that should push to the cache only while another jankins instance that big one public facing could benefit from that generated cache is my understanding correct okay so Damien just to be sure I've understood so then the idea then would be ci.jankins.io would not push to cache we would have something else that would push to cache it may not give us as hot a cache as as if ci.jankins.io were doing it but it avoids the risk of someone doing an attack where they submit something and poison the cache by exactly that's the risk here the attack vector will be clearly this one I understand that we should see some benefits but will we have the same benefits as the incrementals one that you describe like we open a pull request or a few pull requests a lot during a single day let's say we update the cache hourly or maybe all the six or eight hours are we still able to benefit from the when we are frequent tiny changes yep so that's a good good point and so what I did not go into but there is there's both a local and a remote cache so there's a cache you can share between users and then there's one local to the machine so that's a kind of worst case you had no remote cache you could still benefit from a local cache when you switch branches or you make updates and be wrong again so that's something you would still have but even if you have the remote cache on and you do the scenario that you describe yes you lose a little bit in that case where you said okay I'm pull request and I'm urged to master because likely between those two builds the private instance will not populate the cache and I'm not sure how you reuse the agents so for example if you run multiple builds on an agent before destroying it then you can no we don't we don't reuse agent at all or our agent are ephemeral and one-shots initially for cost reasons but also to help on the security even if it's not enough I would say safety instead but all these machines except really really specific exceptional the machines are either virtual machines on the cloud or containers on Kubernetes and it's on two different cloud provider so we don't have the network locality so in the deployment scenario we will need two geographical location one on Azure and one on AWS so depending on the kind we will have a closed cache near to the agents yeah yeah yeah so I would expect that we don't see the same high benefits but I would expect we still see the benefits and again if you try it you will be able to measure it that's the good thing it's not I mean right now it's a bit of a guess but it will not be and of course let's say a developer that that contributes and checks out once a day or maybe once every few days they will still get all the benefits when they pull from from the git and then run the build and it gets it from the cache I also have a question so I don't know since you have a lot of metrics do you have feedbacks about the benefits costs of running your own service and cache versus downloading everything because in our case installing the service meaning hosting a database having a file system block storage that will grow with the cache size plus the multiple installation that will cost us some let's say CPU cycle on both clouds and in the end since our goal is to decrease the build time do you have some metrics about okay if you host on Azure then it's better than Amazon because the cost balance benefits between the time of spending on one hour of AWS T3 medium I don't know if you have such metrics because that one is tricky to measure and that could we need at least one month of running it in a real-life scenario not on a single point to be able to measure if it's worth it Yes so for that I don't have an answer because to be honest so far everybody that's using great enterprise and doesn't host it internally they're using AWS that's that but there we also see some differences in that some some use bigger machine some use smaller machine and also there is the question well do you really need the higher machine I mean it's at some point it's pretty hard to quantify but my experience is also the costs are not so big that it really makes a huge difference whether you're then using A or B so and what I also often see is that yes there's also an overhead in getting things from the cache because it needs to download it it needs to extract it but when you compare it to a goal that runs for five minutes or even an hour it becomes negligible that is kind of there so there's some cost but usually it's negligible Okay I was just asking in case you already had metrics if we can totally provide what we can measure on our experiment in the case most of the infrastructure is open and transparent so if you're interested on these feedbacks we could try to measure of course in that measurement it's only pure infrastructure and eventually this ups measurement time how much we cost per hour or something but it doesn't measure the the value that you presented for the developer not having to wait even if we are not the company with all the contributor we cannot measure the developer productivity here but still it's it's really nearly scoped what I'm speaking of of course because the developer experience and they don't have to wait one hour this one is hard to value in terms of money but it's still a good value to provide so it's just I'm asking and we can provide measure because we will try that will be still a data point interesting data point to that point and just give you a rough number that we host some instances right and what they call we'll host them all in the Kubernetes cluster it's the same cluster all those 15 instances minus two spring and JetBrains hosted themselves but all the other ones we host so that's 13 instances and they cost us around a hundred thirty dollars per month per instance so it's it's not it's not a lot of yeah that's okay of course as you have more data and maybe you have much more load than the number increases but this is an average number for our instances okay but that's a good order of magnitude another question about the installation let's say we have a Kubernetes cluster to host most of the persistent services such as the database the proxy and all the the element the key clock and stuff let's say we have another Kubernetes cluster on another cloud provider in our case Azure and Amazon we tend to separate physically Kubernetes cluster and on Amazon we only run ephemeral agents what the builds happen so that means we would need at least one or two local nodes that provide a cache or any metrics or probes that are used is it possible to have such deployment like some components are outside the main Kubernetes cluster where you have the database yes in general that that is definitely possible I don't want to go too far because I'm not the infrastructure experts okay but no problem let's say situations you also are able to run a cache node not as a Kubernetes but just as a standalone Docker image that is also possible okay so what I've seen is usually when the constraints were given by the Kubernetes how restrictive the Kubernetes cluster was configured and not by what Gridliner price supported or not because in the end it's just HTTP requests across networks first I think that answer my last question is is there a possibility to remove from cache a specific artifacts because that's a scenario we have quite often we generate an artifact that for good and bad reasons ends up being a jar file with a size of 0 bytes that file is uploaded to the cache we have currently most of the time we need to create a new release and update all the dependency chain because we cannot clear remove from our homemade cache is it a feature because that feature is costing a lot of developer headache is it something that is provided on the offer like I want to remove specifically that artifact but that version of the artifact yes so that the in the way it works as you would see in the build scan that the visualization I showed you would see oh this goal ran created a jar doesn't look right you will see what what the cache key is of that entry you can go to the cache admin UI of course enterprise you can enter that cache key you can say delete and it will delete it from all the nodes cool yeah nice feature these are the four a question I had I don't know for airbase if I don't mark if you have others but thanks for your answers thank you for the question ATN you had mentioned to me licensing and that was one if I remember right that needs somebody with signature authority and that I'll probably have to take to the Jenkins governance board because the the infra team certainly does not have authorization to sign something on behalf of a legal entity so but but that I anything you need to share there on licensing that you and I can work it separately I think without without a lot of complication there I've I'm now a member of the governance board so I'll certainly take it to them yep yes so it's it's more like a technicality I mean we see this as a collaboration and it's not like you know I mean we're offering this right so it's we just need something for kind of the worst case scenario where I don't know you invited Microsoft to use that instance for free or something right I'm just making it so it's really just for the worst case in ours that we want to be we need something because we're now commercial in India but we've done before so we have a standard terms as well that we can use as a basis so I'm I think it's just a formality in GMC to work that for that you know that's excellent thank you Etienne thanks very much thanks very much I think I think that concludes next steps as far as I can tell is engage with you to or let me do we need a separate conversation Damien Erwe and Stefan or is it okay if we I start the next steps of working the discussion with the governance board and we'll discuss internally yeah I think we can totally start the conversation with the governance board and the developer so the goal is now is to be sure that our developers see value if there is no override if they are okay on the infrastructure sites I mean we don't mind providing a new tool because we see it could be valuable but we are not the person that really developed and use and wait for their builds every day so here the value for us as infrastructure is if you can help the developer and if you can even help us a bit that's okay but the cost here for us is to have to manage the solution it sounds good on the paper given the answer you and the presentation you gave I mean we can install on Kubernetes and eventually distribute some cash there is the test part I understand there is a part that might be or is hosted on your own for the metrics analytics I assume that you have something on that part no either it will all be with you or all will be with us like there's no share there's no split okay so an ATN if it would be okay with you I'd like to plan for and it may have to be after the first of the after the new year a developer online meetup where we have you present much of the same things that you've done for us here but with a developer centered idea so that they can come and ask questions and say hey what about this what about this your data that you gave what looked like a great piece I would ask for one or two more pieces of data in terms of hey I evaluated this pull request it took this much extra this much time so that they've got that in that answer but would you be okay doing us an online meetup probably 60 to 90 minutes the first 45 minutes or so presentation mode with relatively few questions and then we'll switch to Q&A mode and high interaction would you be willing to do that sure okay great all right so I will I will propose that and go looking for a time when it works for us we usually need at least two weeks of of of alert to the community and given the end of your holidays that means it'll probably be early January before we can do it it's better than yeah great thank you for the opportunity thank you very much exciting and then we'll stay in touch I guess excellent thanks at the end I'll let you go all right thanks very much thanks so Damian I think we still have topics we need to cover are you okay if I go back to sharing my screen or you want to share okay share here we go up strong yes I propose given that we are a bit over time we go over the important part is the accounts Jenkins IO discussion from Loc4Shell and if it's okay for everyone I don't see other emergency or high priority to pick outside this one so I propose that we delay the all the normal operation elements either to the day today or to next weekly team if it's okay for everyone unless you have a high priority topics outside Loc4Shell the Loc4Shell is the one thing that was on my mind and accounts at Jenkins that IO is the one on my mind okay so let's go so first of all public thanks to LV Mark and Stefan also for the work you did I wasn't available during last week for the work you did to assist the security team that we can also give many credit and thanks for the rapid response analysis for helping on all the sites so really a big thanks I wasn't available at all and I mean I wasn't worried the minute given the work that everyone did even if it's a worrying subject so really many thanks for all of you for the work spot there so just a point so thanks so Hervé it sounds like confirm what is my understanding correct that account Jenkins IO so the old application that helps to manage the accounts and directly on the LDAP database and also beta and admin that are two and three point to the same key clock system a public and the back end one these applications are still running on our Kubernetes cluster but they don't have any ingress which mean no one can reach them except the cluster administrator so we keep the logs of this system or at least the current behavior we can still manage us admins but no one can reach them since everyone worked on that can you confirm that I understand correctly I couldn't the Jenkins that IO and beta that I couldn't the Jenkins that IO I don't anymore any ingress but admin the Jenkins that IO is still open running okay and accessible yeah it's only available through the private ingress meaning through the VPN so it's not admin only for that one it's everyone with an open access to the Jenkins infra VPN it's using what we call the private ingress but yeah important to notice because if there is an attack vector through our private machine still we could have had issues so the status right now oh the analysis I understand that no Jenkins systems is subject to the log for g issue because no log for g dependency on Jenkins core or most of the plug-in we use I don't know if there has been a full scale or a scale directly on the virtual machine to see if there isn't dependent a transitive dependency of a plug-in mark did we run that analysis Daniel ran that analysis on CI on Friday and and found no no issue if that's that I think by analyzing the sqa file and the dependency file not on VM and like you are suggesting correct he you're right he did not scan the file at least as far as I know he did not scan the file system that's true okay what he did was scan understand just one final task to be sure on the case of ci given that we have a bunch of plugins that are installed some might not be used so better safe than sorry even though they did to complete analysis from the top I think there are a bunch of scanners that we can run on ourselves from the machine just to be sure that on user local Jenkins we don't have anything I guess it's it has to be run on the Jenkins some not on the Docker image since they already did the Docker image part yep I'm I'm pretty sure they just looked at the source code source source code of the plugin on Jenkins and I don't think they well no I so I definitely ran the the mitigation from the blog post on ci.jenkins.io and that that is inside that goes inside the groovy script console so it's definitely looking at the running Java virtual machine okay is there a way to persist that change if we have to restart ci.jenkins.io if we did the mitigation that inside it's not a mitigation it's a script to check if we are available from is there any plugin using logforge the affected version of logforge or logforge at short okay ash I may have asked before then is there a document even a private one that one of you can share with me because I I wasn't able to find anything about that and I'm sure there is a bunch of someone wrote the script the script to check it is in the Jenkins blog yes all working documents that list what has been done everywhere because I had partial information from different chat channels or issues and I'm not sure if we have documents where we listed that I understand that it was hard so I'm not saying we should have done that I'm saying do you do we have something that that can do yes I'm and I'm happy to share that with you Damien we've okay I we definitely kept the log all day Friday of the of the actions we were taking and the observations so many things with you absolutely so that means all the Jenkins instance are safe based on all the work you did that means the only candidates would be accounts and beta and that mean do I understand that these are the only ones all these operations either are not Java or don't have logforge right yes key clock isn't affected and account doesn't seem so too I've searched for logforge in the source code and dependency tree and they don't found anything cool where were we able to run on the these free services a scan on the containers because these are running on containers and Kubernetes that check that the the dependency is not present on the war somewhere as a transitive dependency so we don't have the dependency from POM XML but it can come from a transitive dependency so the the the assessment that that is done from the script console checks transitive dependencies because transitive dependency jars are also loaded into Jenkins at runtime and so the for Jenkins but for key clock oh yeah key clock we've key clock we've we are taking they issued a statement saying there aren't vulnerable we if not we didn't went further and for account tab I've so in the case of account tab I saw an analysis from Daniel that we're using logforge1.x so that means a version before the code that is the infamous code that triggered the GNDI stuff hasn't been introduced yet so that means account is so old it doesn't see that they got other problem to solve before this exactly so based on all the feedbacks you all did there should be no reason for not putting back this service online so Daniel Beck did have one question he was asking if accounts.jankens.io is hosted inside of something else and I had not done that analysis he was asking about is it inside a tomcat or a web sphere or a glassfish okay so it's running on a docker container which is described on the repository let me add the link on the notes I don't remember but it's on the from let me add the link oh right I think we did look at that didn't we sorry no problem Mark we had looked at the docker image that defines accounts.jankens.io and the from was was an outdated version of jetty if I remember right okay it's running inside jetty correct I don't know that for sure have you checked at Damian yes just right now let me add the link here is the link so which version I don't know we need to check the last the the builds yes and that should be in the notes yeah oh yes here it is okay so oh nope okay I did the the dependency check in the in the notes and there it was only logged for j1.2 note 2.x but the that that jetty jre8 we did the analysis during the meeting but I don't see it noted here sorry about that so it was that is jetty colon jre8 is if I remember right about two years old as an image and so don't know if it's vulnerable okay let me just run the command with kubectl exec to check the exact version we have I think that's right now in production because here we have a typical example that we use the latest kind of image on Docker that means what is exactly inside the image depends on when did we build that image last time we cannot fix exactly all the elements on a given image because we want to benefit some from patches but here in that case it's factual that means we need to check in production what we have I'm currently running the command inside the app container and I will copy pass the command I run to determine the current version of jetty and java so we should be able then to answer back to Daniel so we can continue on the on that one so we need to ask to assess there is absolutely no risk on account jenkin sayo before putting it back so right now we don't know do you share the same analysis as mine I do not know that that docker image is free of any any JDK or any log for j2 instance yes so I don't know that so we need to validate these versions regarding accounts and beta so these two services sorry beta and admin so beta is public and admin is private and will be kept private we have we can I mission you to just to chair the official statement of key clock so we can put it on the notes not right now we can do this afterwards that will be our stump saying okay we can put that service back if it's okay for everyone but that should be okay to put these elements back if we have that confirmation and add a scan a full scan of the of the docker if if needed to have a second security you know yeah good point I propose that we run the docker docker released that scanner analysis on the tool on docker desktop so we could try to run a scan on the same exact image checksum that the one we're running in production as a secondary check so we have from key clock something that say okay it's okay and the docker scan analysis with both if both are positive we should be able to put beta and admin back online or beta sounds good for you we want to take the scanning option someone with docker desktop I don't know how to say something more brutal but that's the word better safe than sorry in English that's that's exactly the same meaning as Santuribre tell you perfect are you are you interested Stefan on running the scan since I know you have docker desktop we can help you on finding the correct one the correct or reference but are you interested on that oh yes perfect okay so Stefan is designed a volunteer for that task at least for the for the for the beta image yeah um I propose RV or mark one is one of you okay for doing the same with the account docker image just to be sure yes if I can get the how would I how would I get access so I definitely have docker desktop and I'm happy to run any tools that will help that how do I find out the exact how do I get and download the exact image of what's being run for on accounts that Jenkins that I took off you have the tag of the image defined on the chart configuration file that should be a first step and then you will you should confirm that the full shh checksum of the image that you have scanned is the same as the one running on aka s cluster so with the describe pod or get pod if you export the pod definition if you're interested we can do this like we because we do the same with Stefan so we can do the free of us and each one of you will then be autonomous to get the correct reference the difference is I don't know if Stefan has direct access while you have mark but we can start the free of us yeah I'm not sure that I'm I'm I actually I'm quite sure I'm not confident being using direct access even if I have it so yes I would I would love a session okay so then I will show you that unless survey are you okay to drive that I'm asking just to be sure because that's a learning opportunity for everyone even if I do it oh I don't so sorry two minutes heater issue but maybe we don't need to record that part right right we definitely know that that that we will end the meeting and then and then we can do do do the exploration absolutely so while while Demian's offline I had opened up the the jetty colon JRE 8 Docker image that I downloaded I'm not 100% sure it's the same exact image but I opened it and I see no reference to log for 2j dash core anywhere in it so so that there is there are references to log for 2j log for j2 but not the dash core that that is that seems to be the place where the vulnerability exists now I have not expanded every jar file that's on the system to do the recursive search so it still could be hiding I've put the note the depression from key clock in the zoom chat okay so the so the the unpacked jetty user local jetty does not it contains several different many different jars but it does not contain any jar with the name log for j in it I'm just locked on the on the zoom screen no way to see anything else yeah okay so I think we are at a point where we could pause stop the stop the recording and call it continue the meeting then without being actively recorded any objections no no okay so I'm