 I'm Mark Waite. We're recording, this is a Google Summer of Code office hours session for the Jenkins Git plug-ins. Thanks very much for being here. What we'll start with is, we want the bulk of this session to be question and answer. We want it to, you ask a question, I'll give an answer, or I'll declare that I don't know the answer, and that's okay too. But we'll do mostly question and answer. Before we get to the question and answer, what we'll do is we will first, each of us will introduce themselves, where are you from, name, et cetera, where you are at in school, those things so that we get to know each other. Then I will do a brief overview of the two project ideas. Those are not the only things that could be project plans, but those are the two that I had in mind. Then we can talk about questions that you have. Let's go first. I'm Mark Waite. I am, let's see, a Jenkins contributor. I maintain the Git plug-in and the Git client plug-in and have done so for a number of years, so you understand my sort of biases. I started maintaining the Git plug-in because I felt a little grumpy about other people breaking the Git plug-in with their changes, and so the way I got involved was I started writing a bunch of tests, and I wrote a bunch of tests and submitted pull requests, wrote a bunch of tests and submitted more pull requests, and the current maintainer said, hey, you know, we're kind of tired of this, should we just make him a maintainer? And they did, and so I became a maintainer, and as I kept maintaining, they sort of faded into the background to do other things, till it became that I was the primary maintainer. And yes, I'm still fixated on tests. I still carry very deeply about not breaking things, and that means that I will bias towards actually not taking changes rather than taking changes that don't have tests. If a change is proposed and it has no tests, I'm very likely, unless it's a very compelling change, I'm unlikely to write the tests myself because I expect an author to write the tests as part of their exercise of writing. I've spent some years in programming, but I spent about 20, 25 years managing, so I'm also imperfect and still learning about programming. So don't be surprised if you learn something, and I learned something in the interaction. It's perfectly fine. We've shaken our heads. I've shaken my head personally and dismayed with some things that, for instance, Rishabh had submitted something, and I realized, oh, this is a longtime bug that I had left in the tests that I had not fixed, and it's really embarrassing, but I'm going to be embarrassed and just admit I made a mistake. So let's go ahead. Why don't we have Rishabh, do you want to introduce yourself next? Uh-oh. Your video's off and I don't have a microphone for you, so well, so let's go. And I don't know how to pronounce your names. Yash. Yeah, you're right. You're correct. Yeah. Go ahead. Could you do the next, please? So, my name is Yash Jain, and currently I'm pursuing my master's at St. Diego State University, and I'm from India, but currently I'm in St. Diego, U.S., so usually I got asked, why did I decide to pursue master's? So I was looking to increase my knowledge set, so yeah, I came here to get a new different set of new challenges, and yeah, I mean, it's been good till now, it's been six months for me in the U.S. right now, so yeah, I've got a lot of new challenges that I hadn't faced back in India. That was like six months of a complete rollercoaster, I would say, for me. So yeah, that's all about me. Thank you, Yash. Thanks very much, and that's great. I understand the weather in San Diego is pretty much perfect all the time. I have colleagues who live in San Diego, and it's spectacular. Good choice, excellent location. Yeah. Sumit, would you like to introduce yourself? Yeah, so I'm Sumit. I'm currently pursuing bachelor's in control engineering in New Delhi, so like my degree is in control engineering, but I realized my interest was in computer programming, so I took a lot of electives and I will try to build my path towards computer programming and did internships and so slowly, slowly I think I've been able to make that shift. I have started contributing to Jenkins since I think December. A few of my changes were in Jenkins core mostly, so and I love the community so far. The community is absolutely amazing. Everybody is so welcoming, and I really liked your story about how you started contributing. It was really nice. Love it. Excellent, thank you. Well, so you and I have a shared history in one part then. My degree is actually in mechanical engineering, and as I was graduating from the university recruiting, I had to tell the recruiter I had no desire to do mechanical engineering. I wanted to play with computers for the rest of my life, and it turned out that that company wanted someone who wanted to play with computers with a degree in mechanical engineering, so a degree in control engineering is a good choice. There are lots of places we do software. Rishabh, do you want to go next? Hey, yeah. What is happening? I'm sorry, I left the meeting because of the internet connection issue. So we're doing introductions. We're doing introductions. Tell us your name, tell us your background, where you are in school, where you're physically located, and some of your interests. So I'm Rishabh Boudhalia, and I've been studying computer science engineering. Basically, it's a dual degree, computer science engineering and MBA, a five-year course in Thapur Institute of Engineering and Technology. It's in Patiala, India. Right now, I'm based in Noida, where I just completed my internship recently in a big data analytics company. So my interests. So I started open source contribution at the company where I was interning. I was not aware about open source contribution. I always thought that whatever contributions we made was either a client request for the company or for the benefit of the company's software. But as one of the mentors I had in the company motivated me to one of the features we had in-house to contribute it to the open source software we are using as well. So that is really the moment where I started to understand the benefit, how the the way I code it, it changed drastically because I kind of learned how the open source community helps you to improve the way you code and the style of code being everything. So that was the point I started open source contribution. In February, I started contributing to Jenkins in the Git Plugin and Git Client Plugin. I started associating with you and I have been working and I've been really impressed with the Jenkins community in the way they've been teaching me, the way you've been helping me and contributing to this company. Yeah. Great. Thank you. Thanks very much. All right. So I think the next topic then should be let's do a quick review of the project ideas. There are two and I'm going to share my screen for this because I think it may help us just to remind me and remind everybody, okay, here are the things that we had in mind. So let me find Google Summer of Code and I'll bring it on to my screen and then share my screen. While I'm getting that brought up, one of the reminders that Oleg Nanashov suggested to me was to remind that be sure that all student submissions make sure that they have used all of this, provide all the sections that are mandated by the Google Summer of Code outline. Be sure you're very careful as you're preparing your proposals and Rishabh, I haven't read, reviewed your proposal yet to look for that specifically, but I will be looking for it specifically within the next one or two days. But be sure it includes all the sections because they expect every section is there, that you've read the details and that you've followed the description rigorously. So don't miss the opportunity to use those hints. And as a reviewer, this is my first year doing being a mentor. So I will tend to make mistakes as well. So you'll have to watch for yourself to assure that your proposal is in as good a condition as possible as you're getting it ready. Okay. So although I did follow the guideline, but I will check the proposal. Yeah, excellent. And that's, that's, that's what I would hope that's, that's great to hear. So I'm going to share my screen now. And here we go. Okay, so the two, the two project ideas that I had offered were get plug-in performance improvements and get repository caching on agents. Let's look at plug-in performance improvements first. So here the idea is that the, the get plugin has, or the get client plugin has two implementations inside of it. One that uses command line get and one that uses a Java based implementation of get command line get must be invoked from Java by forking a new process, by creating a new process, starting that process, communicating with that process and getting its information back. And on windows, especially there have been times in the past at least where that the cost of starting a process and running it has been quite a bit higher than the cost of starting and running a process on Linux. So my thought was there may be real benefits just to on some of the things where the, the cost of the call to the command line get is overshadowed by the cost of starting the process. J get might be faster. And so the idea was take the Java micro benchmark harness and use it to test comparing the command line get implementation and the J get implementation. Now, in addition to that, there's one big glaring known issue, which is that we've got a case where we do two fetch calls when it should be sufficient to do one call to get fetch. And so that was, that was the first example, but this is not quite micro benchmarking. This is just there's an optimization thing that needs to be done here. And then the, the, the other talks about performance comparison using, using JMH and some quick start ideas, some newbie friendly issues that are available and who we are. So before we do any questions, I wanted to, well, maybe the best is do we want to, are there questions on this one specifically that you'd like to ask and we should discuss before we go to the next set of questions? I actually, I have some questions, but they are related to the proposal I've proposed solutions and the implementations. Would you like to discuss that right now? But I think it's going to be a, I don't know, maybe a 10 minute or maybe a 15 minute discussion. So I don't want to overshadow other people. So that's a fair question. I think I would like to get to your questions, your shop, but I would offer rather, let's, let's look at high level before we talk about specific details. We will get to your questions. Absolutely. I think that's the right thing to do for this session. But implementation details. I wasn't sure we should do here before I review the other one. Okay. Sure. Okay. So Yash, no questions from you in terms of, of this particular topic? No. Okay. Yeah. Okay. Great. All right. Well, so then, then let me do the next overview and then we'll, then we'll do open questions on both of them. So get repository caching on agents is, is looking at the, the realization that a single workspace on a Jenkins agent is probably mostly a copy of other workspaces that are on the same agent, particularly multi branch plugins, for instance, where we use a multi branch plugin to test the, or the multi branch jobs to test the Jenkins get plugin. So the master branch, the stable, stable 3.x branch and every pull request are all derived from the same basic repository and cloning that full copy of the repository for every workspace seems like it's wasteful. And there are things that we might do to avoid a full clone of every, every time we need it, because there are probably existing copies of that repository somewhere on that disk already that we could use as a reference repository. And when get uses a reference repository, a reference repository allows the local copy to be updated from local objects instead of always copying the objects over the network from the remote. So there's a pull request that was proposed a few years ago, pull request 502, which offers one, one variant of this, but I think there are several different ways this could be, this could be done, it could be done by, for instance, following, following a technique where we cash things always to the local agent in some central cash, in some central cash. It could be done by looking on the local agent for through an index of repositories that are in workspaces, trying to find an existing workspace. There are several different ways this one could be done, where your, your ideas are welcomed and encouraged. All right, so we've finished the, we've finished the introductions. Now I would propose, let's go with questions on the, on the projects or questions on your, your questions. Yes, would you like to go first? Yes, sure. And I'm going to stop sharing so that we can see each other. Go ahead, Yash. So going through and doing a bit of analysis on the code base and checking out the mercurial plugin. So what I see over there has been done is using a master cash on the master agent and using that as a mechanism to update all the local copies. Okay, so I was, I was thinking if we have a reference repository, like the clone feature, would, would there be overlap on both of these? If we like, would there, would there be a sort of overlap between these two functionalities of using reference, the reference repository in advanced clone behavior and using having this functionality as well? Well, so do you see any overlap happening? There certainly is overlap potential and that's, those are two very good concepts. So the mercurial plugin, I love its, its concept. The person who wrote it is brilliant. Jesse Glick's work on it is, is absolutely brilliant. So what, what the concept that came to me was if, if we, if we use the cash on the master and copied to the agent, the agent and the master distance is probably much shorter, they're probably much nearer than the agent to the get repository, right, than to the central get repo. So there's benefit there. However, the, the copy from master to agent is probably still not, not as fast as copy on the local disc of the agent. So the thought I had was, okay, we might use copy from master to agent as, as the beginning, prime it, prime the cash by going from master to agent. We then got to ask the remote repository, okay, give me the real objects and that will, that will populate them. So that would be one approach would be master to agent and then go ask the remote repository to give us the latest objects. So that, so in case if we have like 15 jobs on this, on the same machine, then we'll be requesting like 15 different, different remote requests would be going into the, into our remote repo. Instead, I was, I was more keen on going to the mouse, how the material have done it using like taking, updating the master first, like, if, if you see a request that, that is making a, that is making a, that is making, that has changes and needs a copy, new copy. So it would update the master cash first, then use the local update the local cash parallely and then use it from there. So then all the, all the local, all the local agents that have the, that cash would have been updated. So the rest of the 14 jobs can use that cash instead of making another request. So, so I was wondering if there would be any concurrency issues hidden behind that I'm not able to visualize or maybe like on, think upon. I like the way you're thinking. So that's, let me see if I can say it back to you to be sure that I've understood it. So I think your, your vision was if there are many jobs running on an agent, they would each make a request to the central cash, to the master saying, I want this repository. They know they, and they would tell the master, I want this repository that is this location, but they're actually asking the master. The master then performs the request to the, the actual remote repository and then delivers the, to the many requesters that are out. Yeah, that, that seems, that seems viable. You would then have a single reader from the master to the remote repository that populates the cash on the master and many, many readers to the agents. And, and since there's only one real repository, master repository off on GitHub or Bitbucket, that seems like a very reasonable way to, to consider approaching it. But I was just hoping that this doesn't break the underlying architecture because if we see how the reference, reference repository has been used, like in the existing advanced clone behavior, we already have a functionality of giving reference repository on every, on every agent. So I was hoping it doesn't break this or somewhere it might break it or it would be an overlap maybe somewhere. I, I don't think it would break it because the reference repository concept is built right into Git. And so Git itself does, does those reference repositories. We're just using a facility that Git already has. Now it may not, it may not give you the maximum disk savings or the maximum data transfer savings. If you don't use a reference repository on the agent. So, so the, the, the one to one to many that you, you described where Git repository to master was one, one chain of, of requests, and then many requests to the agent will tend to copy all of the objects many times to the agent. You could consider one to one to one to many, where you say, I'm going to go from master, from Git repository to master one request, master to agent, cash repository, one request, and then, then spread that use, have those agent jobs, use a reference repository on the agent. Yeah, yeah. But, and that then is using what that's doing is that's even further minimizing the, the amount of data stored on the local agent, and the amount of network traffic between the master and the agent. Now, I don't know if I don't know what point we'd, you'd say, wow, that's, that's too complicated. I'm not going to bother with something that complicated, because there, you're, you're concerned about concurrency issues is a valid concern. PR 502, one of its blocking points was that when I did interactive testing with it, it had concurrency issues that I did not know how to solve. All I saw was I saw that it had concurrency problems and those concurrency problems were serious enough that I would not release it to production. And that was just me and I'm, I'm, okay, I know how to stress the thing, but I am certainly not a thousand people using it. I, I was just one person running a set of tests. And I was able to show concurrency problems in that PR, just by my testing. So I was wondering if, like, what would be a good proposal? Because laying out this architecture of having one to hear, like, from remote to master to master to agent, then agent to other, agent to other job instances and other agents on the parallel running on the, running on the level of that particular agent. So giving out the architecture, giving out the architecture or writing some template code, like code, like using some sort of coding mechanism to display what you're trying to achieve. So for me, architecturally, this one might fit with the, the Git plugin has a way of checking for, checking for updates on the, on the, the central repository on the GitHub or BitBucket repository. It uses Git LS remote, remote naturally as its technique to check it or it will check with a local workspace. That method call might be exact because, and that thing happens on the master, not on the agents. So that method call might be the exact place to start with a small prototype that says, if I detect a change, I'm going to pull that change into a local cache on the master. That, that might be one, one approach to say, okay, Git LS remote is the command and find the places where Git LS remote is called, where it's, where a remote list is done. The other matching architectural concept is that multi branch pipeline in the Git plugin already has a concept of caches on the master. And so between LS remote and caches on the master that are already there, it may be that you'd say, okay, every time the master asks LS remote, it should consider, does it need to pull down more, more change, pull down the recent, recent changes instead of just doing an LS remote. That might be one approach say, okay, somebody pulled the remote and odd changes were detected, bring them in. Yeah. So we need to, then we'd need to communicate that change right now that we don't have the agent level. Or because the jobs are defined centrally at the mat, because the master invokes all of the tasks on agents, I would suspect what you would want to do is consider asking or knowing where the caches are on the master and telling the agent, ask this location on the master for a clone first. So when, when you're about to, when the, when the master is about to send a request to the agent, which says go perform this fetch. The first thing would be go perform this fetch, but instead of fetching from the actual central repository, fetch from the master. I had in my mind was, yeah, using the, like, how do we decide since if a master has a initial copy that was made as a shallow, shallow fetch. So would that be, would that cause any issue or down the line? So the master caches, the master caches, if we would want to safety check that the master caches are in fact full copies, right? Because we don't want a narrow ref spec. We don't want a shallow clone it's not a checkout. So we don't have to work about sparse checkout, but it should, if it's, if it's shallow, then one of the actions should probably be deep in it, right? Switch it from being shallow to not shallow. And if it's, if it's using a narrow ref spec, we probably should on the master admit where you're going to widen the ref spec, even if the user specified a narrow ref spec. Okay. So another output that I had in my mind was to make a, make a bare repository on a given location or a location given by the user that would be used as a single source of references for any, any operation to be performed instead, like that would be, that wouldn't have that would override the sparse checkout, shallow copy, whatever the user had requested, and that would clone the complete repository on the, on the master at a particular given location by the user and all the agents and everybody would know that this would always be the single point of source where they can pull to get the updates instead of having it modules, having it defined, having it randomly or maybe generated somewhere on the workspace directly of the, of the master. And that's, that's certainly also, however, I believe, and I could be wrong on this because the, that, that section of the code was actually implemented by Stephen Connolly with inputs from Jesse Glick. So I'm not as fluent in the, in the multi branch code as I'd like to be, but my, I think what you're describing is only a slight variation on the caches that are already in the get plugin from on the, sitting on the master. So all you're doing, what you just described would be just be changing the storage location of those caches that are already there. So I'm not sure that that'll give you the gain that you're hoping for. It's just altering, what's the name of the cache directory? It's because I had this in mind to avoid the shallow and check out, but okay. Yeah. Right. And, and, and that, that's your, your point is valid. And you may want to put that into the plan saying, Hey, the, the intent is safety check that the cached copies are not shallow and not, not, let's see, not shallow, not narrow ref spec. And, and if it turns out that you, you confirm that, oh, the depth and, or the ref spec is under user control. You may have to then say, I've got to create a new cache concept. I've got to have a new cache on the master, which is not under user control. They don't get to narrow the ref spec. They don't get to make it shallow. And, and then your idea of, ask the user, where shall I put it? You could also on the, alternatively just say, I'm going to create a new one, a new cache, give it a new directory name on the Jenkins master. Yeah. So we have a one level of check over here. That first makes a check. If we have a narrow, narrow respect or something. And, oh, so then it's in, even then that would be that. Okay. Rishabh. I have a question related to the discussion. When you said, Mark, that we could have a different cache, which the user doesn't know about, wouldn't that create problems for the users who have large repositories? Because the cache would be growing bigger and they would not know what is occupying their disk space. And if you're not able to maintain it, I am not sure how we maintain our cache, but if you're not, if you're not maintaining it optimally, then it would probably be an issue of over exercising of disk space for the user. Or is that, is that a wrong assumption? No, your assumption is valid. It's actually, it certainly is a valid concern. A Jenkins master is quite commonly a large consumer of disk space. And that, that's an accepted, accepted reality. Now, if, if let's take the example I had with a private employer, where we didn't, we didn't like it. We weren't terribly proud of it, but we had a Git repository that was 20 gigabytes. And a 20 gigabyte Git repository mattered a lot where we put that thing, right? It was, and, and we, we had to make sure we never, never cloned at anything other than shallow. We never did anything but use a narrow ref spec. And, and it created all sorts of limitations on us for disk space. So yes, we do have to be aware that as an example, the, the testing, the test, the base test repository for the, the repository caching should probably be the Linux kernel. It's an excellent choice. And it starts at a gigabyte. That's, that's, and it's got great history. You figure Linus when he created Git, created it so he could do Git work, so he could do kernel work. And therefore the history in the Linux kernel, and it's got one of the highest volumes of change of any project anywhere in the world. So the, the Linux kernel is the poster child of a big, big repositories with lots and lots of commits. And so yes, it's an excellent choice. And it reminds us that it's not unreasonable to have a one or two gigabyte repository that is maintained by people who are very serious about using Git. Now the 20 gig repository that I had, we were, we were actually not serious about using Git. We had people who are checking large binaries into it, but Linus does not typically check large binaries into that repository. The Linux kernel is big just because it's got a lot of changes. Okay. So I would like to share my screen to discuss the implementation I want to. Is that okay with everyone? Yes, please. Go ahead. So the first thing I would like to discuss, oh, one second. I'm not sure why it's hanging. Yeah. So the first question is that what if tried with benchmarking is that I want to give it as an option, as an additional behavior to the Git plugin user. And the rationale behind this decision is that Git plugin and Git client, they both have a very broad audience. And whatever performance changes we're doing because of the study, because of the tests we'll have, performance tests we'll have, I don't want, possibly we cannot create test cases for every end of use case the users will have because, of course, because we have a broad audience. So what I thought was that initially we could have it as an additional behavior. And here I've created an implementation and there's this prototype. And after maybe a release or two, once we know that from the user feedback that what we think is replacing Git with Git CLI Git with JKit is actually giving them a considerable boost in performance, then maybe we could shift it from an additional behavior to something which is happening inside and they don't know about it as a mandatory feature. Although it makes sense that performance improvement should not be a concern for the user, but what I thought was since this is a GSOC project with a person and I am implementing the changes who doesn't have a considerable experience with this plugin, I would I would want to have a safe, it has a safe feature first and then maybe consolidated after having some user feedback. I'm not sure if this is the right approach. You described brilliantly the conservative approach that I've taken. I don't like introducing new features without an escape hatch. There must be a way to get out of the thing if I've made a some mistake in the implementation and with a way to understand this is very good. There are also techniques in the community that kind of could allow us to gather data from users if you're interested in that. There are ways to to actually allow us to do telemetry where that they could optionally report back to us for a fixed period on their experience automatically. So your technique is not only good, it's very, very well suited. Okay, so would you I would just briefly explain what how I'm going to implement the addition behavior. So it's going to be and get a get a CM extension a class and implementation which we call performance improvement option. What it will basically do is it will decorate the environment variable the environment global environment variable we'll have basically add one flag of get performance flag Boolean flag. So if a user chooses to enable performance, wherever we have modified the code through the results of our benchmarking study, wherever we have modified the code, we will we will have checks if this flag is true, we'll basically implement the implement our we'll select the implementation which is performing better according to us. And if if the Boolean says no, then we'll we'll we'll use the default code path. So it's basically how every decorate function works. And then I've just showed how I'm just so short four steps how I'm implementing and I'm short when once you review the proposal you'll you'll see the implementation mode. So would you think this is the right way to do it or would you would you have some concerns or criticism regarding this approach? I think this is a fine way to do it. I'm not sure I would use an environment variable because the mere existence of existence of the decorator in this case is already a flag you probably I mean if you look, I believe there are other other decorators available already implemented like wipe workspace or like shallow clone, which they they don't use an environment variable to record the state of the decorator the true or falseness of it, they just use an internal variable. Is there a is there was there something specifically that was motivating you to use the environment variable is that you want it available to the user inside the workspace when they're running the shell or tell me more about the environment variable choice. Okay, I think it was the first thing which came into my mind. I thought this is the easiest thing to do that I would because I I think I thought that this variable will be shared everywhere and I would not have issues where I would of course if I have an internal variable as the easiest thing to do that would be shared everywhere so I wouldn't have that problem but when at the time when I was thinking about the solution, I thought that I can actually environment variable everywhere so wherever I need to create a check to if selectively implement jget or get I would just access the environment variable and for and check the flag. Okay, so it's I see so it's a way of of transmitting that state without making every without making every place that references it aware of the specific extension. Okay, yeah, yeah, yeah, that's and I I'm not sure that that the the code you you insert then will be any less aware of of get performance flag than it is it would be of any of an internal variable so a very a method a field on the class so the the extension could just as easy I mean either way every time you want to ask a question of the extension you're going to have to see is the extension there and so I think I think the overhead for you will be about the same in this case you're storing it in a way that it does have the benefit that inside the shell steps of a pipeline or inside the the the job steps of a freestyle job this environment variable would be available and they could see if if it was changing from using jget to get back and forth so so there is there is some communication with the user benefit by use choosing to use an environment variable I'm not sure that it's I'm not sure that's healthy because we may not want the user to actually know which one we're doing we may want to hide that from them but but oftentimes I've made the mistake of thinking I should hide things from the users and then later coming back and demanding that I make them visible okay so would you recommend using an internal variable instead of using this approach using environment variable I think my personal tendency would have been to use an internal variable inside the decorator instead of an environment variable because I my first thought was toward not towards telling the user about it but rather just being sure the code can get to it easily because because then it's a a getter and a setter on the decorator okay so I can change the implementation of what are you done yeah now now and again this is I don't know that if you've already done testing and already assured you'll you'll you'll get a much better answer during implementation phase than I could offer right now because I'm just offering my guess right you'll that's it so it's this is a plan does not require that you have the code if I understand correctly a pan does not require that you're proposing final code it's rather you're proposing yes this is a proof that you've thought carefully about these steps and you've thought enough to realize oh I could do it this way or this way and it doesn't it doesn't have to be one thing or the other it's that rather here's here are my proposed steps okay okay okay that's that's good okay the second thing I would like to ask is I think I guess once you read the proposal you'll be right because I performed performance benchmarking on git fetched and I'm not sure if the results I'm getting are correct I'm seeing change I've seen I used a 320 md repository for git fetched and I compared ckligit with the jgit and I have seen a one more than one minute difference in execution time is that is that correct is that it is quite believable so so considering the investment that is made in the code that is git that is git and and jgit um the community behind git is dramatically larger than the community behind jgit there's there's just no no no kidding ourselves that they are not very different size communities the community that is behind git includes people who have who have worked on linux file systems and have worked on on at the kernel level for a very long time and therefore they tend to choose things and linus chose things in initial implementation that were in fact very very well tuned to the linux file system so if command line git is dramatically faster than jgit I am not the least surprised particularly on large repositories now if it's the other direction where you find oh on large repositories jgit is significantly faster than command line git that is a very interesting result that I got a wonder oh why what what would motivate that because the people who do work on git are so so committed to fast it's it's very very hard to write a second implementation that's better than than what the git implementers did so I did find one observation which is interesting to me when when I was performing this git fetch so for repository size which are very low I have a 34 kb repository jgit was performing better than git in terms of average execution time and for a 4 mb repository as well I I saw that jgit was performing better than git but then I was since this is kind of an anomalous behavior I wanted to check if this is if this finding is right or not so what I did I was I was a little apprehensive of the fact that since we are user the jmh framework applies jvm bombing before they run the performance benchmark so I was I was a little I had a doubt that if the warm-up session is actually giving jgit this boost of performance so what I did was to check if this to confirm if this was the right behavior I tried a different mode of performance benchmark which is called a cold start performance benchmarking which basically means that I don't warm up the jvm enough I actually don't warm up I just I just start counting the execution time right from the the test I've written the git fetched and I actually found that jgit was then slower when there was no warm up jgit was not faster so then I was a little confused so I could only understand that if the jvm is sufficiently warm up when I'm using jgit then only jgit can perform faster than git under the condition that the repository is maybe below 5 mb or probably in the size of kb or maybe one or two mb I'm not sure I haven't tested it that much but but that is what I found out but if the jvm is not warmed up then jgit is not performing better than git and and I am not sure right now currently how would I get to know if the jvm in the real code how would I get to know if the jvm is warmed I would assume the jvm is warmed up when I reach to the git fetch part so basically I think it's a fair assumption to assume that jgit would perform better than git under a certain size of repository so they would yeah what what you just described is exactly the kind of sensitivity analysis that that I was hoping for in this and and what you described aligns very well with what I what I assumed would be the result because what what what happens in your test was let's see there's a cost to fork a new process the cost to fork a new process is relatively constant and on small repositories it may actually dominate the the cost of the the operation now you used an operation which is a network operation so it introduces the network variability and the network slowdown in addition so so and yet still you saw that the the cost of the fork and the cost of communicating between processes was a significant portion of the total cost up to a certain threshold now for me that may lobby that that we would consider in choosing these tune-ups to to use some form of local estimate of the size of the remote repository and yeah I was also thinking about that I don't know how you would do that local estimate of sizes but there there's it seems like if you've got a local copy somewhere in a cache there's an easy estimate or if you've got you could use heuristics about hey if well yeah you you can imagine all sorts of guesses as to how big is the remote repository to tune which which implementation should I use should I use yeah so that that might be a feature we could implement to actually check the size and then implement the performance it just cannot be just an environment variable check there it has to be more checks to actually enable the implementation which would perform better I think so well then and now you did not mention anything about platform as a variable here and I would assume that platform should be included as one of the variables that you check for sensitivity I agree actually yeah sorry I could check it for I did check it for two platforms it was my macOS my local system and one EC2 machine which was a Linux instance I could not do it in a windows I probably well before I submitted I maybe I'll add one more study I don't think you need to do it even before you submit just put it into your plan that the plan will do it and let me offer two more platforms that you should consider as part of your plan IBM series 390 so IBM mainframe and IBM power pc 64 and maybe a fourth is arm 64 and the reason I'm offering those is they are three different places where the Jenkins platform special interest group is right now looking at them adding adding support so and they have very different characteristics the IBM mainframe is a completely different way of looking at things actually and so this very interesting in terms of oh I'm on a or arm 64 is my raspberry pi and and then its file system has a completely different behavior than the file systems you have on your mac for instance so we won't run Jenkins on a Raspberry Pi I run Jenkins agents on a Raspberry Pi all the time absolutely and how I don't understand why why would you do that oh if I if I need to evaluate tests or code that I'm intending to target to a raspberry pi the best way to do it is run the agent on a Raspberry Pi and and run the test right on the Pi okay okay so I feel like these platforms in my proposal yes I someone someone I know my son actually works for robotics company and they use they use this device a it's a an NVIDIA Jetson which is an embedded device with a GPU on the board and and I can see very much putting a Jenkins agent on this device would you do that oh yeah I fully intend to that one I I haven't hooked that with him but I've got several Raspberry Pies already running okay I didn't know about that okay so one for me as well and but you're on the right track right you're you're doing exactly the right thing and thinking about what are the what are the axes of performance evaluation and what is the sensitivity along that axis and repository size is clearly one the architecture or the the operating system of the computer seems like another one particularly with windows where the the cost to fork a process at least at one time was much higher on windows than it was on Linux okay yeah and I think I have apart from platform and size of repository I have also would the jvm parameters affect the performance of jgrid or that's that's a very good insight they certainly could I had not thought of that and I think that's that's a valid thing to check I know that the cloudv support team has published recommended guidelines based on their experience of what the best jvm settings are for the Jenkins master so jvm parameters is a very good item to evaluate okay I have added that as well okay and the next question I had was okay the git double fetch issue which is one of the performance issues we have known performance issue the fix I provided was basically a flag which kind of avoided the second fetch would you didn't say that you had additional tests to test if that if that solution is not creating any kind of loss of information so would you give me because I've included the solution in this proposal would you give me some more pointers so that I can maybe test the efficiency of the solution more or would you recommend me to look for another solution maybe some kind of an argument matcher which basically a class which matches the argument like we have a clone command initially and then we check if that if we have the same clone command if you have the same clone commands then we would avoid the second patch if that's not the case maybe we do something else I am happy to share the the the things that I created to do my initial testing and it's not completed but my initial the thing I created to do my initial testing was more about me doing interactive testing not doing automated testing I have a strong personal bias towards first explore something interactively and then I'll express it as automation but I'm happy to share the automation the setup that I use to do my interactive testing because it's just a docker image and I'll happily I'll post where that is in the in the Gitter channel so that you've got you can see it you can see how I do it it is it is a public github repository that I use that defines a full and complete Jenkins with a number of interesting jobs in it and some of those jobs are exactly to test for this case and so I can I'll post that in the Gitter channel that's that's a very good one we're saying hey look here are here is this job this job and this job that can can be used as checks for that specific the double fetch okay sure here I'd love to see that docker image okay I think the last thing I would I would want to ask is I think it's seven that it's about yeah so I also wrote a bench micro benchmark test for for the solution I implemented for this redundant fetch and when so I had two baseline tests where I had one test which had a narrow respect and then I tested gift get I I had to get fetch basically I had to get fetches in that in that test and and one with narrow respect one with wider respect so and after that I removed one get fetch to basically show that if we just have one get fetched with a narrow respect and with a wide respect and then compare those four tests to see if there's there's a reasonable increase in the performance when we remove one get fetched so what I found out was that for the for three of the repositories I chose which are which are basically the order of size for kb for mb and then I think the last one is 40 mb or 50 mb for those three repositories when I when I used when I removed the the the second get fetched there was always there was the time the execution time was it reduced by 50% all the most and I so I was I was a little apprehensive with that would it would that actually happen because initially when I was looking at the issue I thought okay if I just remove one get fetch I would have I would improve the performance in the order twice basically in the order of but I don't understand is this correct I correct correct is correct is a hard thing to say but is it what you observed yes is it believable to me yes so yeah but okay one thing I did notice as well was for for the large repository of the order of 324 mb the test did not give me a very remarkable difference between all the four tests they did not give me a remarkable difference which was a little confusing to me because it should I think this should not depend on the size of repository if I'm removing one get fetched oh no no it very much it does depend on the size of the repository it is it does it should anyway so so I'm not sure I'm reading your graph correctly so on the graph is it that the is the group of four bars maybe zoom closer so I can see what the the axi or just describe it you don't have to zoom in even just just go ahead and describe yeah so the bars are basically the repositories I used okay so so so and and each of the groupings is whether there is a redundant fetch or not yeah the first two tests include double factors and the last two tests they they don't have the redundant fetch wait a sec okay so it's so the the top most the top most rows are without the double fetch and the bottom most rows are with the double no I'll give an example so for the first test which is known as a benchmark baseline with narrow depth spec it has both the git fetches and for the first repository which was of 4kd the execution type was 79 mb for operation the average and for for the scene for the same thing when I when I honor that spec and I used a narrow spec for that and I have I have removed one git fetch I have redundant I've removed the redundant git fetch I I saw the average execution time to be 162 milliseconds per operation okay so I think I am confused myself here it's the I think it's the reverse it's the reverse yes those the first two tests they don't have the the second git fetch they just have one git fetch command and the last two tests they have both the git fetch command that is what is happening because of course the time is increased so both of those tests would have both git fetches got it okay so and and so what that the way I'm interpreting the data is that the top most group of four and the second most group of four are probably both without a redundant fetch where and then the next group of four and the bottom most group of four is are both with a redundant fetch yeah okay and and I'm I can't I am I would have expected the results that are shown for the tiny small and medium sized repositories let's call those the the dark blue less blue a little bit less blue but not the green one those those results those seem expected but of course they're their sizes of repositories that aren't aren't the example for this particular right 40 megabyte repository okay that's getting interesting and I can't explain why the the removal of the redundant fetches did not dramatically improve the oh no no wait a sec I sure can yes right of course because maybe let's let's let's try this let's let's discuss so fetching a 40 a 40 megabyte repository is probably dominated by data transfer on the first fetch by data transfer of the objects right it's getting 40 megabytes of data so it may be that we need to have you tune this this benchmark to say because most fetches into Jenkins get get workspaces are not the first populated this is the this is the populated case right this is yes starting from an empty directory fill it up and and that is a that is an interesting case but it's not the 90 percent case 90 percent of the cases or 50 percent at least are okay incrementally bringing new changes into an existing workspace okay and so this is and so that may be something you need to consider in your in your proposal is add the additional attribute of first fetch versus update an existing existing workspace okay sure sure I'll do that I'll change that okay great yeah so I think this is the questions I had thank you for all right well thank you we have we have run out of time Sumit I didn't ask you for any questions are there any questions you wanted to ask Sumit before we close actually I was just an observer because I have another project that I'm right now focusing on but you know and just gathering knowledge because who knows something this might click for me so great all right well thanks everyone I'm going to end the recording I will upload the recording to the Jenkins YouTube channel and post a link to that recording to the Gitter chat thanks very very much thank you thank you I'll get in there you too bye bye