 everything. And let's go. Okay, so I'll start with the agenda directly. So the first thing I wanted to discuss today was the interactive testing I did for the redundant fetch fix and the modification for which for which I had a discussion with Mark. So the first thing is interactive testing. I shared the plan with you guys yesterday. So I took some scenarios I thought would affect would affect the results of removing the second fetch. The first one was with advanced clone behavior because of course we're cloning we do this is directly related to the fetch, get fetch. So the first was to see if fetch enabling or disabling fetch tags would directly result in any kind of difference in the information we have for the repository. And so I apart from looking at the result, I also looked at the code and I was looking at how the second fetch is handling the same behaviors, because if there was a difference in behavior, then there would be a chance that the results would be affected by removing the second fetch. That is what we saw with with one of the one of the issues Mark caught, which was that some of the references were not being handled by the first clone API, but were actually handled by the second fetch we were performing. So with fetch tags, the second fetch is also doing the same thing as the first fetch is. So there is no difference when if we disable them, the first first, if we enable them, the first fact, the first fetch will bring all the tags, we disable them, there will be no difference. The second one I was interested in was the shallow clone. First with no depth provided, I was, I wanted to see if by default the second, the second fetch, is it providing some depth by default? And the first is not doing that so that there might be a difference in the commit history. So, so when I when I compared the code, I could see that the clone API basically shares the same implementation for doing a shallow clone. So they both if I don't provide any depth, they have a by default depth one level for doing shallow clones. So there is no difference if I so the behavior will be same if I if I remove the second patch, the first fetch will take care of performing a shallow clone with depth equals to one. Then this third test was with shallow clone with depth two. I think it was, it didn't I think it's the same thing it doesn't it won't make any difference in the results. Then timeout was kind of just I just wanted to see if there would be I was sure that timeout would not make a difference because both of them again, they implement the same get clone API is also is doing what the second get fetch the functions it uses. It's the same functionality. So there is also no difference in case of timeout. Then what does that this timeout specify exactly? So basically time what it specifies is if if you're cloning a repository, if it takes more than five minutes, it's going to it's going to abrupt abruptly cancel the bit. That's what happens is the mark, right? Yes. So that is timeout. By default, it's 10 minutes for an for an for an operation. Yes. The second scenario was with Vibe workspace enforced clone. So this I just tried because I wanted to see because it's it's cleaning the repository and forcing a reclone. So I wanted to see if that would somehow change anything. So I enabled Vibe workspace and I tried. I tried I compared the results without the fix and with the fix and I could not see any difference in the repository information we had. Then it was for checkout for a specific branch and I realized so this is this is the third scenario and I realized that it has basically no it's something we do after the step of which involves these double fetches. So it's in a function called the tree changes. Check out is a stage which is which comes at a later part of the code. So then there's this interesting behavior I found out which is not user visible. It's called get SEM source reports. As far as I could understand it's it's basically done to it's done to those reading and let me recall it's it's done to go to the default river to the default behavior that is enabling on a respect and the second thing was I don't remember exactly. So I was not sure if this would make any difference. I tried it. It did not but since it was doing it was it involved honoring respect. I just wanted to check if this behavior would although I could not understand how is this behavior being called. I did not go too much in deep how this behavior was working. I just wanted to test how this is working and then pre very much I wanted to ask Mark and Fran and everyone you guys if it would make any difference. I actually did not try this scenario. I would not expect to make any difference but it's an interesting question because pre build merge requires two branches at least inside the workspace right there's got to be a source branch and a destination branch but that you've got to have that one way or the other whether it's from a wide respect initially or from an honor respect and you declare both branches so I think I think it is is on an effected by this. Okay so I haven't tested it so I think I should test it still. I would given what you've done so far I would even propose skip that safely. Don't worry about it. Let's trust that it's it's it is not going to affect this. Okay, so after doing these interactive tests there was one interesting problem which Mark he pointed out and that problem was that with the ref specs while while we're fetching the ref specs so we have multiple kinds of references. Ref specs are basically mappings for the references between a remote repository and the local depository so so we can then the ref specs which the first fetch handles they are they are related to the references of branches that is refs heads any any related branch or star which brings all the branches. Now if there is a case when we want to check out for a particular pull request which is a feature specific to GitHub or maybe something some other kind of a reference for GitLab. The first fetch will not handle that if we are not honoring the initial the respect for the initial flow if the user is not choosing that option. Those fetches will those references will be missed by the fix I have proposed and so so that so this is a huge issue for for us to move on with this fix because that would bring direct failures and many use cases will be broken. So so I had a discussion with Mark on how we could safely retain the fix and modify the code so that we can we cannot break any use existing use case and we still do not have to call the second fetch although the current modification I have tried there are cases where I would have to call the second fetch call to not break any use case. So right now would you guys like me to go through the fix the modification I've tried and then the interactive testing I tried upon upon that modification to see the cases which Mark Mark pointed out for which code was breaking the use cases. I tried those cases and now it's working with them. So would you should I explain and go through that code or would you guys review that from the PR and that's a we shouldn't use this time on discussion of that code. I'm I'm open to either I will have to review the code either way. So let's look to the Omkar and to Justin and to Fran. What's your preference? Do you want to skip detailed code review in this session and go on to other topics because you've got several other topics we need to address Rishabh, right? This is this is not the only topic for a session today. Agreed. Agreed. So that is I just wanted to ask. So was that is that there already raised the one you mentioned is is it talking about that? Yes, yes, it's it's a comment on the latest PR. I'll share that PR on the Gitter chat. I think it's 904 if I remember right. I'm I'm becoming very aware of exact PR numbers for your work, Rishabh 845 904 because of the multiple bills. Oh, no, it's because it's very important. I'm very interested. Okay, so just in Frank and Omkar, would you like to discuss it or should we move forward? I think we can. Yeah, thank you. Yeah. Okay. So the second, the second agenda is related to the performance benchmarks and benchmarking in general levels. Very benchmarking has been surprising and irritating at the same time for me right now. So what has happened with the first of all, what I what are the important things I have to discuss is that I was profiling the Jenkins instance with with whatever changes I did with the fakes and without the fakes with Java flight recorder. So using that profiler, what what I was experiencing experiencing with consecutive builds was some kind of issue I could not find out what the issue was, but there were there were huge time differences between the git fetch calls between some repositories, which was which was what I showed in the platform, say meeting and which was wrong. So then and it also took a lot of time for me to change the repositories and then again, run the instance fixes, it was taking a lot of time. And I wanted to do it with a lot of repositories to actually see how are we doing with redundant fetch, what kind of performance overhead would be reduced if we're allowing that fetch. So I shifted to so I shifted my strategy to use a performance bench or GMH benchmark, because I would have to just write a benchmark, I will, I will have parameters where I directly put a lot of multiple repositories and I, and then I don't have to do anything, I just have to wait for the results. And also, theoretically, benchmark is the thing to understand the root cause if you have to do a root cause analysis, it's the best thing. One of the best things to do is what I thought. So, so I tried a bench. So I've written two benchmarks related to the redundant fetch and I'd like to show you the results and the benchmark. So the first benchmark is so, so I've raised a PR for the redundant benchmark as well. I've also written another benchmark. Meanwhile, I'm going to show you that benchmark first. So with this benchmark, what we're doing is we have so I've written two benchmarks here. The first one is going to use the initial clone. It's going to initially, so it's going to clone a repository. And we're going to see what kind of time it takes to clone the repository for the first time. Then in the second benchmark, so the first benchmark acts as a baseline experiment that we can compare when we actually add the second operation that is the second fetch call. How much time difference are we gaining because of that? So the second benchmark is basically, again, the first thing, the initial clone, and then again, a fetch, a fetch operation. One thing I realized when writing the benchmark was that I was not doing any kind of validation to check if these operations are actually doing what they should do. And I what motivated me to do that was that some of my benchmarks were giving me results which I could not understand completely. I was like, I don't know how to write a benchmark either or I'm not able to understand. So what the initial validation I've put here is that the first thing we do with the benchmarks is that we clone the benchmark from an upstream source repository to a local place for the instance of the benchmark. So that is the first thing we do. So at that time, I record the size of the repository we have when we're doing that. And then when I'm using the benchmarks, and I'm actually cloning them from that local upstream, local repository, I have local repository. So once I do the fetch, I compare the sizes and I see that actually this operation has taken place and the time is not just because it's actually not doing what it's supposed to and I'm getting time analyzing them and I'm inferring some observation from that and those operations are not doing what they should. So this is the first validation I have put here but I am thinking of putting more validations to actually see if the operations I'm trying to benchmark are working or not and the times are real or not. So with this benchmark, the results I have is yes. So the results I have, I'll explain what you're seeing here is. So these are the two benchmarks. The first benchmark is here with the initial clone. This is with the double fetch calls. So this is the second benchmark. This is the first benchmark. With the first benchmarks, the color grading, it basically means is that so we have get and implementations. I'm actually also testing that as well to see how both of the implementations work. So the two bars here, these represent, so I took two repositories. The first is the Jenkins repository and the second is the Ruby repository. And why I took those two repositories? Let me just show you the reason. The Jenkins repository, the number of commits 30,000 branch is 31. With the Ruby repositories, I have 61,000 commits. It's basically double. So I wanted, so one more thing I'll discuss after this, whatever I explain is that I'm not able to find constant size repositories or something near that when I'm actually looking for real repositories. So with the Jenkins one, this was the closest thing I could find. This is 366MV and this is 471. There's a good 100MV difference. But this was, I thought maybe I could get something out of this because the commit size is increasing. The double, it's double. With the branches, it's not, it's actually, let's not see the branches here first. Let's see if there's actually any differences because of the commit. So the results here we see is for the, for those two repositories, Jenkins repository and Ruby repository. The first one is the Jenkins repository with the git implementation. The second bar is the Ruby repository with the git implementation. Then the next lighter blue color bar you see and the gray color bar you see are both of the repositories with Jenkins. And this is for the first benchmark and then the same thing for the second benchmark. So if we see the results, technically real-life performance, there is theoretically from these benchmarks if I could infer, there is no difference, no tangible difference between a single fetch and adding a second fetch on that same repository. And as you can see the first benchmark, 11 seconds per operation with the second, with Jenkins, it's again 11 seconds. There was some difference. It was some microsecond difference, mini second, sorry, not much. So I took the time unit for seconds this time because I actually wanted to see real-life differences, how much would we have when we're removing, when we're actually adding the second fetch. So and the only difference, the only difference I could see was with the last one that is with JGIT and with Ruby, there was one second difference between both of the benchmarks, which is kind of... So you're confident that it was really using JGIT as the implementation. Those are so similar to each other, it seems like they're either both using, they could either both using CLIGIT or both. That's fascinating. I can't explain what you're seeing. That's really interesting. So I have another test for the redundant fetch benchmark. And you looked at the logs and you saw that the logs were showing in both cases. Or did you use some technique to confirm that, yes, I really am using JGIT for this, for this one. I really am using CLIGIT for this other one. So I usually log, I have, I usually print and I'm using, when I'm with the benchmarks, I print the implementation I'm using. That is how I'm sure that, okay, this is how it's being calculated. And since this is looking a little absurd, I actually have a place where I ran these benchmarks. I'll show you. Okay, so and this one, that one was clearly using command line get because it has the command line get markers in it. So. Oh, which one Mark, what are you? I just saw a screen that went by that had the clear logging from command line get. I don't know that it's all related to what you're doing. Go ahead and show us the files that you were going to. Yes, so this is the run which happened. The visualization you're seeing, these are the results in this form. So here you can see where the first benchmark, which is just the initial clone, get it's giving us 11 seconds. And with again, get with the second benchmark, which has two fetches, it's giving 11.181 seconds. So the difference is very minute and it's not a good thing actually. Okay, that supports the observation I had earlier when initially people told me this fetch is enormously expensive. You have obviously found at least one case where the fetch is not enormously expensive that redundant fetch. That doesn't mean it's always free, but at least it means you found one case where it is free and it's surprisingly low cost. So interesting, fascinating. I wonder if so when you're when you reference the the repository on a local disk, do you reference it by absolute path or do you reference it by file colon slash slash url? File colon slash slash url. Okay, because you may want to read the CLI get documentation. I don't think they do the same optimizations for jget, but CLI get may do some things where they say I know this is local and remember that the person who started writing this was Linus and therefore he thought very seriously about file systems. He says if I know it's local, I'll just do hard links or I'll do symbolic links or I'll do, you know, there are all sorts of things that he could do knowing, oh, this is local. I'll just I'll just make take advantage of the fact that it's local. Okay, that might be because with these benchmarks, we're assuming what we're doing is we're fetching the repository from a local file system repository. So this is not a real use case, real life use case. That is so because with profiling what I saw, I saw results, not huge results, but at least there were there was a 10 second difference between so that the second fetch it was costing around 10 seconds or maybe eight seconds or 12 seconds, at least that much. So this was a little surprising. Well, okay, so another argument here might be that in fact, the network transfer time to do the incremental fetch is so important, it may be maybe a dominant factor there, and therefore we won't see it in this in this intentionally controlled environment. Good. Okay. Interesting. Yes. And one more observation I think which we've discussed already is that JGIT with a larger size repository, JGIT is performing waivers than what CLI get is doing. Right. I kind of have a question that why are we you, why do we give JGIT as an option when we're seeing that I actually don't know why we use JGIT. I have never asked you that but why are we using it when and we see that for any normal size repository, JGIT is going to perform worse than CLI get. So the original dream many, many years ago before I became a plug-in maintainer was that JGIT would be ever been as good as command line get and we would get better results by using a full native implementation. The reality about a year into using that implementation was we learned very painfully that JGIT was not a complete implementation of CLI get and since that time the evidence has proven it will probably never be a full implementation of CLI get. The people who maintain JGIT are very committed to it and they do great work for the things they need from it. But of course they work on the things they need and so the one use case where JGIT is very, very helpful is if you have a platform where you can get Java but you don't have a command line get port JGIT will still work for you. So in that case it's interesting for large repositories it looks like we have clear evidence it's never interesting for you. The other danger with large repositories is it's using Java virtual machine memory to do the clone and therefore you have to worry about memory leaks inside or inadequate garbage collection etc. inside the JGIT implementation whereas with CLI get it's always a sub-process the operating system will garbage collected for you. So yes your observation is very wise that why use JGIT for anything larger than about 10 megabytes. Okay and so the next benchmark I have actually raised up here for that. So this with this benchmark what I'm doing is I think I'll show the benchmark as well. So with this benchmark we have multiple repositories it's from the Jenkins these are Jenkins repositories small plugins I just incrementally increase the size and number of commits, number of branches to see what are we seeing here and I was actually tired to find so what so the conclusion I came into was that either we we need to create repositories on our own where we have constant sizes but different number of commits Omkar also suggested that but the issue with that I see is that we will never be able to simulate what an actual repository is for an example right now I can compare two repositories where I have a 30,000 number of commits difference I think it'll be very difficult for us to see all of those kind of parameters we set those parameters for repositories we create ourselves while we're benchmarking but to have a clear sensitivity analysis where we directly want to find out how this parameter like the number of commits would affect the execution time for Git fetch without freezing without taking the size of the repository constant I'm not sure how we'll be able to confidently say that okay if we increase the number of commits this is how the execution time is going to change because the size with what I've seen the size always increases when the number of commits are increasing with what I've seen I think that's kind of an obvious fact and so that is that is one of the issues with this strategy both of the strategies I have so with this this benchmark sorry I had some of the repository I had four repositories here and it's doing the same thing it's actually not doing the same thing here's a difference the difference is that with the earlier benchmark I was actually cloning the repository for the first time within the benchmark test so I was benchmarking the execution time for that operation as well here the that operation is is taking place in the setup before the benchmark it's it's happening before the benchmark so ideally it should not affect the time so clearly I should what I should get is the execution time when I have the results the execution time for the incremental fetch is what I should get from this benchmark and so what I got so I'll just show the benchmark this is the benchmark it's incremental fetch the Git client I'm using it the Git repository it's referencing it should it should have already have a Git repository fetched from the local Git repository I have so so the results here we can see is that with Git so the colors you see is basically multiple repositories with Git and then with JGit it's just one benchmark so we don't have a confusing result here it's not that much confusing so with Git as we increase the repository size one positive result I can see is that the execution time is increasing for the the cost of having an incremental fetch is increasing though the increase is in microseconds uh milliseconds but it's an increase and I'm sure as I increase the size I take it to maybe much larger repositories we'll have a change but uh what I have to do is after this one of the most important thing is to map this the theoretical observations with practical observation and to do a practical observation what I've seen is that I can use JFR profiling tool to see how for those repositories what kind of performance overhead I am reducing while I'm avoiding the second fetch with this we can see that okay now there is a change there is a difference there is an increase uh when we increase the size of the repository the number of commits also increase the number of branches increase but I can never say for sure what is contributing the most for the Git fetch right now because since the size of the repository is increasing I there's no way I can say that okay the commits is why this is happening for for that to happen I need maybe two 500 MB repositories with one we're having there should be a clear difference in the number of commits possibly something like 20,000 commits in one and second might have 30,000 or 40,000 so that I can see okay for these constant size repositories if the number of commits are increasing this is how the execution time is increasing or decreasing or it is having no effect so but that is yeah I thought I thought our intent here was trying to understand which thing should we include in the sizing heuristic and isn't isn't your observation here saying we should include both repository size on disk and number of commits because they seem to both show as they increase we we the execution time increases so do we already have enough information here to say yeah number of number of commits in the repository and size of the repository on the disk are both relevant to to performance so we include them in the heuristic yes mark you're right the ultimate aim for doing that is to to find what performance how is the performance affected from what predictors but what I'm saying is that we're not able to test them independently not as independent variables they are they are dependent I'm not sure is if the file what is contributing more to the performance changes in the git fetch is it the file size is it the size the pack the size of the pack dot pack object or is it the number of commits of course it would be the dot pack object most of the performance would be affected by the size only it's common as my common not sure technically I haven't even yes yes mark no no you go ahead and finish excuse my interrupting no no so my hypothesis I wanted to test was that if we have a repository with a large history and maybe not a considerable size but a large history would that affect the second fetch more because what I assumed with the second fetch was that the first fetch would download all the word clone all the objects the packed object the second fetch does not have to do that what it should do is to is what I think I haven't checked I haven't looked into confirm this what it should do is to iterate through the list of the commit history or basically it has to get the incremental differences or any changes in the repository the second fetch would want to do that and to do that it would go through the history and so my my hypothesis what that was that the redundant fetch would actually have a considerable performance overhead if we have repositories where the history and the branches they're they're larger then there are cons I would say a considerable number is there for those repositories that is something I wanted to test and I'm not I'm still not sure with these we're sure that with increasing the size and all of those the number of commits we're seeing that there the performance overhead of the second fetch is going to increase we're sure about that because we can see that with both of the benchmarks not for the second first one too much but with this in in with a the microscope the time unit we can see clear difference but I'm not sure independent variables how they're contributing to the performance and also again we can see that jgit is actually performing better for us for small size repositories the thing we have the the observation we have that for a small size repository jgit is going to perform better than yet we've seen that with these this bench mark as well that it's performing better though it's the difference is not much in real time I think we see the differences with much larger repositories jgit is not performing better I'm not sure how much this would affect the performance for a user noticeable changes but theoretically it's jgit is performing better than yet for small size repositories so yes so with benchmarking strategy I have so if if our aim is say if our aim is just to see that so we need to make an estimator and to make make an estimator to estimate the size of the repository what kind of parameters we need to see so the obvious one is the size of the object the second it's safe to assume it's number of commits number of branches but how much how much independently they affect the performance is something I haven't not able to figure out right now that's that's my concern yes so any yeah what do you guys want to say yeah I so so I'm I think you've you've you've answered the question should we include size and number of commits in in the in the the assessment absolutely and we've got you've got data here that says yes jgit for small size repositories is marginally faster so so there's there's another incentive to say okay we should now probably look at code and say or put you into code and say all right how do we use this now to implement the heuristic or to implement the estimator the size estimator and and start seeing how do we get that into the code to allow people to the option to say I want to use the fastest thing for my repository okay that is what I thought as well that we could we have clear evidences that some of the parameters they how they are affecting so we can start working with the estimator and and I think the next agenda the thing I had agenda was analysis on fine so we have discussed this performance predictors for git fetch now the repository size estimator so I thought that I would write a class I could show a prototype on how we're going to do it the approach but I could not do that this was stuck with benchmarks and effects but I did research on the heuristics we were talking about a little bit I saw the things we could do and so the first option we had was I think the first option we had was to use the easiest one to use github APIs or the API is already exposed by these providers so with github war the problem I have seen is that they give the size of the bare repository instead of the actual repository so so that might not be a clear indicator of the size of for the deposit so I experimented that with vscode microsoft vscode so I cloned it and I and I also I I tested the API provided by github to check what was the size it was returning to me so the size was around according to the github API it was around 300 and but when I cloned it it was around 900 so that's that's a huge difference inside this so I checked around so I found out that github they on the servers they have repositories bare repositories so they they gave that size as a result when we're trying to receive that information and yes Mark so on when you say that the repository in your local copy was 900 meg does that mean that the dot git directory was 900 meg or that the whole workspace including the checkout copy was 900 meg or reshub I think we may have lost you yeah I think we did yeah okay well while reshub gets ready to come back I assume he will eventually reattach for us as mentors we've got an upcoming day arriving soon oh he's back good I'm sorry I'm scheduled powered guy so yeah yeah so I was what were you saying Mark I kind of missed it so when you when you when you say that the the vs code repository on your disk was 900 megabytes is that just the contents of the dot git directory or is that that the entire checked out copy not just the dot git directory so to to measure that how do I measure the size of when I'm cloning the repository how do I do that I basically I I think I have it yeah so when I'm cloning it it usually shows us downloading the objects and so if you if you I'll show that instead of explaining something I think we all think we can go ahead yes so I'll show it here so take a little bit of time so the amount of objects it's downloading so it shows the size so here I just so this is what I consider the size of the repository when I'm cloning if you guys can see my and I have no idea what that number represents okay all right so so that number I don't know what that represents I I usually look at the size the du minus s output of for the dot git directory because what that tells you is size on disk of the of the the fundamentally what is almost the bare repository as represented on the other side so I think I have to confirm that I haven't I I I think I did check the objects dot pack object which is downloaded by cloned by cloned so I could see similar sizes from from these from this thing and from from that object but I think I'll check that mark first to see if that is working and so and with estimators with the estimated class so right now the one option mark gave was the great option is to if we have a cache cache repository for for the project we could use that if it exists so we do that currently in our code right mark and and if we want to if if that exists already we could that's it's the best way to estimate the size of the repository uh so we could use that I haven't explored uh how in what for what lifetime that cache exists where would I find that cache I assume it's it's it's on the it's in the workspace right agent workspace no it's it's in the master only in the master actually and and okay so mark if I if it's in the master and I have my workspace in agent so either I so what do I do I reference it okay for one second master agent please share the same hdfs and I'm actually not very sure how that's going to work so so the the the the execution of most of the logic is happening on the master for you therefore you can ask questions of the cache on the master pretty directly so things things like the when the git scm object is created you can assume that's on the master and that scm object then can can look at the local cache and interrogate the local cache so I don't think you have to I think it'll be pretty straightforward actually if you just use that I don't even think you'll have to do a cache lock in all seriousness because I think all you're trying to do is look at the file system so you get the get the directory of the cache and then knowing the directory name you go use file system calls to ask for the size of the contents of that directory and that gives you a relatively quick approximation of the size of that repository repository yeah okay so I'll let you explore that I think I'll write a prototype as fast as I can for the estimated class uh so um I think we've we've extended the even officially unofficially extended this to one is that okay with everyone I'm sorry that oh yeah for me I trust that the mentors who can't be here will drop off so so you that that's we set it at 30 minutes and we'll we go as we can okay so I did want you to get to the the demo work the demo plan because you've got a demo coming up is it next week or the following week I think it's next next week okay great yeah so um so I have to do a bunch of things I saw the mail so I have to publish a blog post before that I think we should discuss what all should I put in the demo and so um from my side what I would like to show is first the whole benchmarking strategy how we're doing that the code and then the results what I would also like with benchmarking maybe one more operation so that it's maybe like gate ls remote I was interested to see how that would work maybe I have some interesting observation to show for gate ls remote so I I I want to expand the benchmarking strategy for those two operations um that's the first thing we could show the second thing would be the redundant fetch work how we've done it and um so I was thinking because of the demo I would have to show what I would have to show something visually so I as a feature or something in the user interface and most of the things I have it's usually code or weird results so I'm actually not sure what what are your guys expectations how is the are you the guys who will be my uh who will be in the panel when your evaluation in the evaluations or is it the core committee of Jenkins how so yes we're the we're the evaluators you it's it's us you've you've got all four of the evaluators online and sorry yes so I guess the best yes yes mark yes sorry I think I think if you show graphs and I don't think you have to show a Jenkins UI as much as graphs and highlights of hey here's what we've learned as part of this exercise look at this look at this here's an improvement here here's an improvement here and people will be more impressed actually with with graphs and charts of performance performance comparisons than they ever would be with show them a Jenkins UI because we knew this was a performance performance project okay yes so that's a great relief because I was as I was seeing other projects I was like I don't have any kind of so I was seeing that they have their own plugins they have user interfaces and I I don't have all of that right yeah last year from my experience last year we had some other projects that were similar to this as well and it's it's not a big deal like if it's a plugin-based thing where you're actually building a new plugin yeah like you might get into the demos of how that works and use your experience and stuff like that but yeah like Mark said I think I'd definitely focus on the meat of this project and that'll people will like it in a perfect world they will see nothing different it'll just be faster right yes so so so if you if you show I'm going to show you nothing except it's faster then that that should already delight people it's like wow that's great because usually it's it's faster and I had to break the following things in order to make it faster okay so so what I'm thinking is the first thing is the benchmarking strategy where the kit fetched what I did and how I improved the benchmark on the Jenkins it's running and everything one thing which is missing right now which I haven't showed you guys is integrating the jmh visualizer plugin on the Jenkins which I have to do that I haven't done that because that's I think it's going to be a great improvement because we'll be able to see visually how the results are showing so that's something I'm going to do and I'm going to do it for get LS remote as well so for these two operations the benchmarking strategy then with the redundant fetch I think from the the the fix the would you guys be interested in seeing the testing scenarios and the cases we considered while we were fixing this and the use cases we had to consider if we would break them or how to do this safely the whole thing or is that something we don't have to discuss we have to discuss the fix and then the benchmark which would show that this fix is showing some improvement I did yeah so for me I'd keep the the testing in your back pocket in case somebody asks hey how did you check this I suspect the audience will be I'm I'm gravely concerned about not breaking compatibility right that's a big deal for me but the audience the larger audience probably will just assume that think of course no one's going to break compatibility and so they'll be more interested in your results with numbers and with the performance results and your observations on hey here are the characteristics we saw okay so with benchmarks as we've seen that the theoretical results they're not showing much of a difference so what I want to say what with the the redundant fetch the results I would like to show is the profiling results as much as I can so that I have a large sample and the result is results not something which we do not expect well I think it's okay to show the show the surprises as well and say welcome to the real world sometimes we get surprised by how software behaves I feel no shame in declaring that we were I was that you were completely surprised to see this result comparing to this other result and that more investigation is needed that's that's perfectly okay okay so so the benchmark results and the profiling results both of them for the get it done fetch issue and then I think the third thing would be the third thing would be the estimator class if I'm able to create that with some heuristics which we've thought about and so I need to first consolidate the approaches we can take and if it's even possible with the way we want to do it because I'm right now not too mature because with the apis I was actually seeing something which I discussed their difference in the size because of bear and the object I have to confirm that with the cache thing so if the cache doesn't work then what we do because with the cache it's I think if we have the cache then it's it's it's simple to estimate the size but if we don't have that then it's the real work where we would have to understand how what's the how we could estimate the size I was hoping that number of commits and branches would have a great so if we could get the number of branches we could get the number of commits and even if you don't have the size of the objects we might be able to make the decision but with the the experiments I've done I was not able I'm not able to find out independently how these factors how they contribute to the performance so maybe I'll try to I'll try with some more experiments if I'm able to isolate that so I think I already have a lot to yes and one other thing you could maybe try if you wanted to like if we wanted to rule out the disk things like the Linus thing that Mark was talking about you could maybe set up like a Bitbucket or GitLab server on your local network put these repos on there and then that would like get you to like maybe Git's not going to optimize for being on the file system that's a possibility as a that's an optional possibility but well and Rishabh I have an environment that we could use to simulate exactly what Justin described I have a local Git server on my network that that happens to be just full of all sorts of interestingly sized repositories so so Justin's idea is good however even before that I would take one more I think you've learned something in this extra you you've gained a crucial piece of knowledge that I don't think you highlighted nearly enough as you're in your summary it is that there is a performance curve there's a performance result for Git command line Git and there's a performance curve for J Git and there's an intersection between those two curves that is dependent on repository size factors and and that is something truthfully before you did this project we did not know that I had not I had an assumption but I had no data to support that and what you have is you have hard data which says as these attributes of the repository increase the the characteristic performance of Git is like this and J Git is like this and that curve if that curve is your opening slide even for me that would be great because it says oh uh oh everybody should be aware of this characteristic of the J Git implementation okay you've done you've done concrete measurements and the measurements keep showing over and over again this exact same story that with large repositories J Git is a poor choice yeah and and so people should be aware of that you've you've already contributed to the body of knowledge just with that that that initial graph okay so okay and make sure that I so I did not highlight it this time because I thought that we've already had discussions at benchmarks which show these results that is why I wanted I did not highlight it as much so um so this is what I was saying I'll show in the evaluations and apart from the plan also what I have to do is I need to write a blog post where I'll probably show the results to the thing you just described I'll show that and I'll show the benchmarks and I would also have to make a presentation so would you guys like to be a part of how I'm making the presentation or is it something I'll just make and I'll show it's you guys are not involved in it so I propose that we that you show us a your initial framework of the presentation on Friday if you would be willing so that so that we have a chance to give you feedback for instance it asks for a blog post and I thought you know what the performance results you've seen would be a great blog post let's say look just for the information of Jenkins users without any code change you should be aware that if you choose JGIT and your repository size is larger than such you are sacrificing performance intentionally okay okay so I'm going to do that and with the presentation okay so I'll show you show guys something on Friday presentation sample presentation just to highlight we use JGIT as the implementation on ci.jankens.io and that's fine for small repositories but remember that the documentation repository and the Jenkins core repository are both well beyond the threshold size that you've identified so I already have an improvement to make in ci.jankens.io to get it to get some performance back okay yeah so um okay so I think this is I think this is what I wanted to discuss with the blog post I wanted to ask do I have to do that on Jenkins.io or can I do it oh it's it's mandated with Jenkins.io I thought I was setting up a github page blog and I was thinking that I could do it there you are welcome to start it there but ultimately I will expect it to be a Jenkins.io blog post but what's that it'll give you some visibility to having it on the Jenkins.io right okay yeah it'll be good I'll add it to the Jenkins.io page okay okay so so the evaluation starts from June yeah yes Justin yeah plus one for like demoing your demoing your demo uh that was a good way for us to get feedback uh before and then uh one thing that we did before too which is up to you uh I think we had done it in Google Slides doesn't matter like what technology you use but if you want to share that with us too like we can do markup and comments for if you want feedback on some things too okay up to you I want you to use Google Slides necessarily but I like to use that because it'll be collaborative the work and we collaborate at the time okay okay so I think that's it thank you guys for spending much more time than allocated for the meeting thank you Rishabh okay thanks I'll post the recording thanks everybody have a great day we'll talk to you Friday Rishabh yes