 And all right, recording is started. Welcome, everyone. It's the Git plug-in office hours for the performance improvement project. Rishabh, take it away. Yeah, so I'm sharing. You have to enable screen-sharing, Mark. Once again, Mark is failing to do the security thing. You can now share your screen, Rishabh. So for today's agenda, the first thing we have to discuss is adapting the benchmark and code inside the Git line plugin. So Mark, you were saying yesterday that I think it's an issue that the benchmarks take a long duration to run. And if anyone is changing the code for anything, I think it's unnecessary for the benchmarks to run right now for every build that we create, for every PR that anyone raises. So you were saying that we should change that. What change were you suggesting? I could not understand at the time when you wrote that. And were you saying that we should remove it from the Jenkins build, the flow of the Jenkins file should we remove it from there? No, I think, well, this is a great chance for the mentors to give help. I think one of the techniques we could do is we could put in the Jenkins file a conditional which says, if the Jenkins URL environment variable is ci.jankins.io, and if the branch is master, then run the benchmarks. Or we could also say, if the branch name is and choose a name that's symbolic that says GSOC, GSOC star, if it matches GSOC star, then we say, we'll run it there. And the typical PRs won't use that name, and therefore won't run the benchmarks on those pull requests. So I think we absolutely want it on master, and we absolutely want it on your pull requests. So that means we need some conditional logic in the Jenkins file that says, if we're on this branch, do it if we're not on that branch or we're not on ci.jankins.io, don't do it. OK, I understand the conditional logic in Jenkins. Now, let's let the other mentors kick in. Does that seem reasonable to everyone else? Or do you have other ideas or better suggestions to reduce the load? So it simply means we are supporting the wild card over here, just if I'm not wrong. Don't understand the question, Omkar. So what I'm asking is the conditional logic is it simply based on wild card master star or GSOC star, like that? And that's something whoever does the probably Rishabh, actually, when you write the entry in the Jenkins file, you have to figure out, is there a way to do a wild card pattern match? I think Groovy has wild card string matching, but you'll have to go figure that out. I don't remember. I always have to go look it up. Yeah, you should be able to do something like that. I mean, you may not even need that necessarily. I mean, if you decided that master and then some GSOC branch name are the ones that you wanted, then that would work. Or maybe you're perhaps thinking that maybe there'd be multiple GSOC branches. Yeah, my assumption is any branch that starts with GSOC dash, so make it a wild card match so that he has the chance to run multiple branches in parallel. Yeah, I mean, Groovy's just Java with syntactic sugar, so you can do what starts with for that case. And then for the Jenkins URL, there's an uppercase Jenkins underscore URL environment variable you could use to determine that you're in CI.jackens.io. Well, and Rishabh, just to be fair, it may be that the Jenkins URL check is unnecessary. The Git plugin is somewhat special in that one of the maintainers runs his own CI infrastructure and cares very much about it. But that's kind of weird. Most maintainers actually don't run their own dedicated CI set up with lots of agents, et cetera. So if you skip the Jenkins URL and just checked on branch name, that's probably still already good enough. And if there is any question there in terms of how to express that in the pipeline script, you can ask in the Git plugin channel or in the pipeline authoring sig channel. They're both more than happy to help. OK, I'm going to research about it and then if I have some doubts, I'll ask. Great. OK, so then the next thing we have on the agenda is the current status with the tasks I have. So I've been working on the Git redundant fetch issue. Mostly that's where I've spent my time. First, the first thing I did with the fix was to first solve the current issues we had. Those are the test failures I had for that PR and I fixed those. And during that process, I learned that there are a lot of scenarios I missed. And so for that, I think first I should discuss what the issue is then and then go for the solution and possibly the scenarios I have to think about. So OK, so I've explained the issue in a PR I have. So basically in the checkout step in our Git plugin, we are for when we're cloning the repository for the first thing, what happens is that we use this API clone underscore and this is essentially Git init plus Git fetch step. It's not a Git clone actually, but it gives the user all the functionalities a clone can give. So we do this first and then we have another Git fetch, which my assumption is that we have the second Git fetch because we want, for an example, if I have a fork repository and I mention it as a repository URL and if I have five more commits in it, so the difference I can in the commits I can get that through the build when I'm running the build. So that is one of the possible use cases I could understand of having the second fetch. And so this is the why mark. Could you jump in here and explain? This is my assumption. Is this correct or is there a larger reason to why we have the second fetch? I think the second fetch is a terrible accident, most likely. And it's an accident that has existed for a very long time. So I wish I could give you a better answer, Rishabh. It's embarrassing to say I can't give you a better answer, but I suspect that second fetch is accidental or accidental is a funny word for it, but that second fetch is there and it's been there for a very long time. Yeah, that's okay. So that's okay. That's not important, I think solution and the things we have to care about are more important. So I'll show the fix I have created for it. So the fix is very simple. I just created a Boolean flag and if we are cloning the repository for the first time, then it's true and we basically skip the second fetch if we have used git fetch for the first time. So this is the solution I provided at the time I was writing the proposal. Now, the things we have to consider when we're fixing this issue, when we're avoiding the second fetch. So one of the first things is, one of the first thing we have to consider is the possible options we are providing with the clone behaviors we have and the fetch behaviors we have, the decoration of the fetch command and the decoration of the clone command. The second is the repository data. When I am avoiding the second fetch, I have to make sure that there is no loss of repository data when I'm doing so. So for that the only option which for the repository data loss or loss or possible loss, the only option which is of concern is the honoring of respect at initial clone. And for that we have some scenarios I have mentioned in the PR. So what could happen when we're using honoring a RefSpec at initial clone or we're not using it? So if you're using honoring RefSpec at initial clone, if our respect is narrow and if the second fetch is being ignored, we have to make sure that, wait a second, I think I'm confused here. If we're using honor RefSpec, so first of all, let's talk about if we are not using honor RefSpec. So if you're not using honor RefSpec, the git clone API it basically assumed that it's going to fetch all the branches instead of even if I've given the RefSpec, it's going to fetch all the branches. And that happens because we couldn't break an existing use case related to get it plugged in as far as I remember Mark. That's why it was provided there. So then we are not honoring RefSpec and so what we'll have is we'll have a wide RefSpec and wide means that we will be fetching all the branches so we don't have, even if you're avoiding the second fetch, it doesn't matter. It doesn't matter because we have all the branches so there will be no repository data loss if we avoid the second fetch. But only when we could have a possible scenario where we could lose some data is when we have a narrow RefSpec that I am just fetching a particular branch in the first call and in the second and since I'm avoiding the second call which would have in the earliest before the fix which would have fetched all the branches, I possibly could lose all the reps, all some of the branches or some of the data I have. So this is the first thing we have to consider when we are creating the solution and when we are creating the automated tests we have to test if this is happening or not. Questions there because I think I'm not giving a very good description. Do you guys have questions for this particular scenario? Then I'm going to explain the next scenario that I figured out. Yeah, so I think I've understood it. So what you've highlighted is that this is much more complicated than, or this one already had a complication. This one I'm comfortable with continue with your next. Okay, so the next one I think is the one that's got, that I was completely perplexed by until what you learned. Okay, so after that what I understood, so one of the unit tests was failing because of my fix and that was because of a behavior and additional extension we gave which is called clean before checkout. So because of that unit test I understood that when I'm avoiding the second fetch I'm also avoiding some of the behavior which we provide because of the fetch command. So what I did to make sure that I'm not missing out anything I created an exhaustive list of possible behaviors we have and what is rated to get loan which is getting it plus fetch or what is rated to the fetch command. So after creating that list I basically listed all the options here which we could possibly have. What I could understand was that with the fetch command we provide clean before checkout we provide prune stale branch that is pruning the branch and then we provide pruning the stale tag. Apart from these the clone command and the fetch command they have common options. So the point here is the point of highlighting this thing is that if I want to avoid the second fetch I need to add these behaviors to the first call and if I don't do so I will be missing out some of the additional behaviors which we will be providing. So those behaviors I've found out is clean before checkout it's pruning both the branch and tag. So with clean before checkout I think it's so my solution to this problem to this problem is to basically add features to the clone API to provide all the features which are missed while we are skipping the second fetch call. And this is why this is according to me a good solution because we're not breaking any functionality any compatibility I'm not changing the fetch command in any sense I am adding few things to the I would have to add few things to the clone command and to the clone API. And I also check that we are not using this clone API apart from using it for cloning the repository for the first time in this step only and in test cases. So this is why I chose this solution and so to implement this solutions few things we would have to do is first to the option clean before checkout I would have to add it to the clone step and I think for that it's pretty simple because we just have to clean we just have to perform get clean for this option it basically cleans the working directory the current working directory that is what get clean does. So to migrate this option to the first fetch call to the first clone call I would have to change the clone command I would just have to I would not have to change the clone command I would have to implement the clone command in the clean before checkout option I have done that and I've raised a PR for that I will link it I'll post it on the Gitter channel After doing that there are two more things which and the second concern I had was is there a difference between the advanced clone options which is provided by the get fetch API and the get clone API and what I found out was the differences which might not be of the concern for us here but still I wrote the differences the first was that the specs we add the specs in the case of get clone and in the case of get fetch we have to add them while we're calling the get fetch in case of get clone we're not doing that and the second is where we we expand the references using the environment in the case of clone command we're not doing that in fetch command but I assume that that was done thinking that the first fetch the first call is always going to expand it so the second would not need to do that so that is why it was removed it was not there for the decorate fetch command actually I think it's just a bug in decorate fetch oh okay because it should have resolved in both places right I think I actually got a bug report we have a bug report that specifically says it should have expanded variable references in the fetch in addition to the clone so okay okay I look into it I look is that an existing issue you're saying that I believe there is yes there are several there are several issues around places where the plugin does not correctly expand environment variables and I think ref specs is one of them there may even be a pending PR to fix the expansion of ref yeah I raised that PR mark oh that was you oh sorry Rishabh and that's okay so okay so so the next thing we have here is prune so the issue I have here is with pruning and the issue with pruning is that pruning is it's related to git fetch it's not provided in git clone so when I'll be although in git clone we are basically doing git init plus plus git fetch so logically it won't make a difference but if I'll be adding pruning as an option in clone command and then in the clone API I assume that that makes it a little confusing the only issue I had with shifting it or adding it to that API is that clone doesn't provide pruning and I would be adding that as an option to to accommodate for to accommodate for this particular issue so that's a concern I have that's that's a problem I have by shifting it for for the clean before checkout I think it's it's pretty easy but for pruning either the branch or tag that's the issue I have so these are the scenarios I've explored right now I want to know if first of all all of the things I've said questions and then secondly am I missing things which I should should I more look more should I look more deeply where am I missing things so I haven't detected anything that you're missing at all and I'm delighted for the analysis you've done and I think that adding the prune those prune options is exactly the right thing to do because the thing that that the get plug that the get client plug-in calls clone command is is a conceptual thing more than it is a specific promise to call get clone because it's it's never been able to call get clone right it's when when the earlier maintainers attempted to call get clone they very quickly realized oh I can't define this and I can't define this and they reverted back to init plus fetch now there are certainly there have been requests asking please call get clone but in order to do that we would have to lose functionality and so so that's that's not really palatable right now therefore I think your proposal to add capabilities to the clone command makes makes good sense okay so uh just to put yeah mug just to put one point over here so uh are we taking into consideration any future changes that git fetch might have though though it is not that much uh frequent changes like git is not that much frequently changing but in case any new functionality comes so we'll have to in con in con put it that in our code also right if if there are additional features and and there are certainly examples of that already things like large file support which might well be better done inside clone command than where it is currently being done so but I don't think that what reshub's proposing harms our ability to add future extensions he's just adding extensions in a in a very reasonable way to clone command fran does that that seem reasonable to you yes uh I don't think that we are going to lose any uh capability but uh I admit myself because I wanted to ask about the possible impact in the performance of the in the clone command so comments in the clue okay is your is your concern that the extensions being added may harm performance of clone uh yes if I understood if I understood properly a reshub proposal a we are moving a some features to the clone command so we are not doing so we are doing the the prune and the and we are going to do the a just to try to avoid the second a fetch command so is I was wondering about how that can impact in the in the performance yeah and I think exactly the same performance that or is something that is acceptable I don't know maybe yeah good question so today the the thing that the get that the get client calls clone command is actually an implementation of init plus fetch it's then followed by another call to fetch so fetch first and then it's called by a second call to fetch and then the extensions that would that have been added the the the cleans and so I think what reshub is suggesting is his goal is still just to in terms of get operations remove the redundant fetch but continue persist in calling the extra extensions the prune today is implemented as a separate call to get it calls a get a get prune or various techniques reshub did I express that fairly do you want to give more clarification I think my my aim is to and to answer France question maybe if I can is if we are performing get init plus get fetch and if we are adding something which is already being done in the next fetch why would it be a concern of performance I cannot I cannot understand that yeah the the hope the hope and again this is there are things in in command line get that are subtleties that I don't I don't expect but I think your logic is is fair reshub that it probably won't affect performance and we should bench we should measure it just to be sure maybe yeah we I can create a benchmark to test what Fran is saying I think that'd be a good thing I think more practice for benchmarking is great so yeah I'll do that and yes I think the main aim is to avoid it and and do not lose any kind of use cases any possible use cases we have I think do it the most safest way possible so yeah and after this I think the necessary the most necessary thing is creating automated test cases I think the before and after scenarios for so what I was thinking was to since I have the fixed PR fixed branch I have that I I was thinking of cutting branches from that particular branch for each of the shifts I'm going to do the additions I'm going to do and and I wanted to do that and not in a single branch the reason was that so that we can discuss individually the changes like like Fran has a concern related to one of the maybe pruning or maybe because of some other option so we can discuss them individually and I have individual test cases for each of the changes in the clone command so so I was thinking of doing that that was my plan I like that and I like that very much incremental incremental sounds really good I'm I'm really grateful for what you've discovered because it was a subtlety that I had completely missed in my code review I simply could not explain why the test was failing with your initial change is like what do you mean all we did was take out one fetch it can't possibly have stopped cleaning and yet what your your analysis showed was oh yes it did stop cleaning and it stopped cleaning because the extension is not registered in the first call it's registered only in the second call second call yeah so I think it was great that we already had great unit tests for it if we did not have the unit tests I would never discover it I'd say it differently it's it's pathetic that we only had one test that caught that failure when there were three or more extensions that should have cut the failure right you you found at least three cases that that should have been detected and only one of them fail yeah that also I had the question that why did not pruning stale branches or pruning stale tags we we could not we we should have tests we which would right I would add them while I'm creating those shifts excellent so and I also wanted to discuss one more thing I think we were almost I would ask the the other mentors would you be willing to remain online this feels like a very effective session I've got at least another 30 minutes that I could give if other mentors are available we'd be delighted to have you stay because it feels like we're on a on a very productive discussion topic good for me okay great we shall let's continue let's take up to up to another 30 minutes be right back okay okay mark so the next thing I wanted to discuss was not related to get to this particular issue it was it was related to the design document and why I could not as of this moment have a consolidated design document because while in the community bonding we were discovering things I I felt that the document was more of an investigation rather than a consolidated design and now when we when we actually have a benchmark running in the tank NCI and and and some discovery related to this issue now I think I have I have a lot to put in the design document and all the discoveries of the results we have related to the benchmarks and the profiling and I just wanted to say that I'm going to consolidate all of that and release the document within one or two days so that people can look at the approach mentors can look at the approach I have taken for the benchmarks and for the things I'm doing here because I I think the benchmarks have not been there we have seen the results but the technique the way I'm doing things I think there's a need for us to review the way I'm doing it so so that was one of the concerns and other is that I have to work upon implementing performance as an opt-in behavior that's that's something I have to figure out a way to do and I think the another thing I have to do is testing the geometry realizer plugin that's something I I still I haven't tried so I am going to do that this week and I think since we have more time so I'm just going to look on my notes what what we can discuss it I think we can discuss we can discuss so I ran a benchmark on to show that when we are using double fetch what is the performance and when we are not what is the performance and so I've shown this result to mark during my proposal discussions not sure if you remember it mark but so how I created this benchmark let's let's first discuss that so I have four tests and the tests basically what they do is the initial two tests what they do is the first one is just calling one fetch command the second one is also calling one fetch command it's kind of a baseline test to create a reference point the first one with the first one I'm just fetching the master branch with the second one I am fetching all the branches so it's it's kind of a narrow ref spec versus a wide ref spec comparison and after that the next two tests I created the benchmarks they highlight if we use the fetch operation twice how much time do we gain and that is done with honoring ref spec and when we are honoring the ref spec that means that and with that I mean that the first fetch call is going to be narrow in this test and the second is going to be wide and with the second test both are going to be wide because we're not honoring the ref spec for the initial that's the option we have right so the tests I think they clearly show that once we are using the double fetch call the execution time is almost doubled in every case though in cases of large repositories it should have it should have considerably doubled because as far as I remember Mark told me that with small repositories this is not a concern but with large repositories it's it's it's it's much of a big concern bigger concern the double fetch issue so so this this benchmark I created while I was writing my proposal so maybe there might be some issue with the way I've created the benchmarks that we're not able to see the differences I'm also skeptical about jvm optimization if if I'm performing one operate if I'm performing an operation twice same operation would the jvm optimize it the when we are measuring the second call would it reduce the time for that call that's I so to to that question I would assume that jmh is doing its very best to attempt to give you preheated environments so that the jvm is already as as cashed or as optimized as it's going to be for the first run and therefore the first run and the second run would would have comparable results I think your numbers that you had done earlier supported that as well that the preheat phase seem to seem to stabilize seem to be enough that the that the actual runs were quite stable and but the only concern I have is with the large repositories the time is it's not it should double or be something like that but it's it's not actually I I would not have expected it to double on large repositories that so that's a that's a the git should be quite efficient at realizing and there's no work to do on the second fetch oh it's any so it should have been it should have actually been the second fetch should have been a no op but but it's it's not a no op and the fact that it's not a no op is what what has astonished a number of our users wow you asked fetch again and I I had to spend another one minute getting that fetch on my big repository Omkar go ahead you say you had a question yep Richard so you told like in the first one when you did the clone of it so it just pulled the master branch right and in the second one it pulled all the branches existing right yes so how many branches were there in total just for information okay so I have I'm testing these benchmarks for four repositories and with each repository the the number of branches increase for the first one it's just one branch for the second one as far as I remember it's around six then it it's it's it's around I think 10 to 15 for the third one and for the last one I think it's it's the most I don't remember exactly the number of branches I think I can show you the number of branches for each of the repository okay so one more thing what I doubt is like it depends like one of the factor is the size of the repo definitely and the second factor can be like the difference differences that are there in the branches like one code has 100 commits and second branches like 200 different commits so pulling that together will cost git a lot not just like one repo is heavy enough but there are no differences in the branches that will be I think lightweight and git might be optimizing it okay so what you're saying is that I have I also have to consider the commit history the structure of the repository as well by line maybe yep yeah that might be one of the factor yeah and and I and then I was creating the benchmarks that was the initial assumption I so I I took a very structured repositories when I'm increasing the size I'm also increasing the number of commits and the number of branches I actually if I can open the design doc I let me just if I can find the link to the document yeah okay so I have it here I'll show you the structure for the repositories okay yeah so these are the four repositories and this is the structure the number of branches they increase okay considerably while they're increasing the size and the number of commits has well increased okay I think we can take that factor into consideration like in the second one you have 14 branches so we can create some differences in the commits in those 14 branches and try once well so so I'm I'm less concerned about variation there I think you've got good sanity checks here Rishabh so I yes there there may be differences but whatever set of differences we choose is going to have some examples out in the real world which are radically different from that set so so I'm I so I'm sort of I guess disagreeing in this case with Elmcar and saying I wouldn't spend your time right now worrying about varying the number of branches rather we accept that we have empirical data from users that say the second fetch is not free and and now we're going to keep measuring to see what do we see in our benchmark environment is the second fetch free and so does that does that seem Omkar are you okay with that as my as my shameless argument as my shameless proposal that that is totally fine so what I was considering like he was talking like he didn't notice any that much significant difference which he had expected in those graphs so for generating that difference intentionally I was like suggesting this that can be one of the factor to impact that right and and Rishabh if we get to the point where we want to evaluate Omkar's the the permutation he's describing I actually have a poster child a near perfect example repository that I use all the time called Jenkins-Bugs it's filled with 150 or more independent branches that all evolve independently and it is a complete it is a complete mess in terms of its content and it's intentionally a mess so so so we have a repository and it's about 40 or 50 megabytes right now so it's it's it's in the middle between your it's between core utils and chiral in in terms of its current size and it just keeps growing okay Jenkins-Bugs that is right do not do not do anything with it right now but if we get to the point where we need to evaluate something that is pathological or that is is really terrible I have a really terrible example that we will do that one before we do the the Linux kernel all right the Linux kernel is is even further down than that okay okay Mark so I think after this what I would like to say is that the tasks I would I would perform after this discussion is that I would raise three different ps for the addition of the behaviors I'm going to shift not shift but copy from the fetch command to the clone command and to the clone apis and with that the automated tests included in the pr is only and then apart from that I would have a benchmark to test the differences Fran was talking about in the performance and I think while I'm doing that so these are the for the next week I think these three the two tasks I have in terms of coding the third I can possibly possibly have is of researching how I can implement and the the get the opt-in feature for the get performance for the performance benchmarking improvements are going to do so yeah any concerns related to the tasks I have or yeah I like those would you be interested in presenting to the platform special interest group a week from tomorrow with whatever results you have at that point sure sure mark yes I think I'll have the benchmarks related to this issue and and I think we would have benchmarks related to the infrastructure all the benchmarks we have from the infrastructure runs so maybe I can I can discuss that okay and if you'd like I could discuss that yeah I would love it if you don't mind giving 10 or 15 minutes or a five or 10 minute summary of hey here's what we've learned since the last time I think it's been about four weeks since your last talk at the at the platform SIG and so it would be a great excuse to sure share your results up to that point to highlight progress to indicate problems etc okay show them on okay when is that meeting mark it happens each Thursday at 6am mountain time so Justin it's it's almost unbearably early for you it's 5am your local time and I'm sorry for that but that's that's when the contributors to that particular special interest group could meet so we do it very early so that we fit with the european schedule and with one of our colleagues from Broadcom who really likes to get up early in the morning even though he lives in Phoenix Arizona time zones are hard right I get it okay so I think that's that's it's from my side now that is what I wanted to discuss so yeah also I'm going to look for the conditional logic in Jenkins file the thing we discuss that's something I'm going to look at first well and Rishabh that one is a little that one has a sense of urgency for me because yes I'm preparing that for the next release of the get client plug in and it's there are times when it's slowing down my progress so if that one could get more time from you I would be very grateful and I'll do it first thank you yeah okay so short meeting I think and let's and let's other mentors have other questions or comments I think we can call this done no I mean I think you did a great job like in your analysis if you need help with the conditional logic like Mark said reach out to to folks that was great okay okay just ciao all right thanks everybody bye