 Okay, go ahead. So, yes, so, so what I did was I created a Python script, which what it does is it for a particular organization, it fetches all the repository URLs. And then I wrote a simple Java code which would, which would apply gate LS remote and give me all the reference map, and I took it size. And then I mapped it to the size of the repository. How did I get the size of the repository? I use the GitHub API available for it. So that is how I map both of those things. And the easiest way because the Jenkins repository, it has, it has more than 1000 repositories, the organization. So it was, so when I started looking at it visually, I thought that would be more beneficial. It would be to use some kind of a metrics to metric to actually find out if there's a genuine relation between the things we're looking at. The two things is the first parameter is variable is the number of references for a particular repository and the second parameter is the size of the repository. So the experiment, it aims to find a positive correlation between them, if there is one or not. So there's this metrics metric called Pearson correlation coefficient. It basically gives us a value between zero to one. If it's of what I've read, if it's greater than 0.5 or point, yeah, 0.5, the relation between them can actually be said it means something significant. And so, so what we got with the Jenkins repository, the Jenkins organization, I got it 0.2. And then I tried it with Netflix repository. I tried just randomly different. So with Netflix, I thought they would have huge repositories just like I thought that since it's a, it's also very widely used for just like Jenkins. Let's just not go there. I just thought that the repositories would be big and the references would vary in an interesting way. Same with Facebook. They also have some great open source products. So I did that. And with all of them, the, the Pearson, the correlation is coming very low. So I actually also looked at, I have posted the link of the data as well. If you want to see the raw data, what, what I could see was that for a 500 size repository, I saw a thousand references for a one or two MB repository. So that basically says that we cannot possibly use it as a heuristic that would lead into totally inconsistent results. So you can, you can look at the raw data if you want to, and I also so make it to make it a little easy. I actually plotted the number of references and the size on a graph. And this is for the Jenkins repository, all of the points. So this is on a log scale to make it a little more clear, not in the linear one. So what we can see is if, if there was a correlation between, so this is a scatter plot. So if there was a, if there was a significant correlation between them. So here in this graph, you can see two things which would, which would tell you that the correlation is not significant enough. The first is the shape of the scatter plot. If it was linear in any sense, it would, the plot would, the line you're seeing here, the linear line, it would, it would be around that. It would work, and I actually am not able to find a word for it. It would populate, the points would populate around the line would give us a shape where we could say, okay, there is a linear relationship between these two variables. The second thing is something which is automatically calculated R square is the coefficient of determination. It's kind of, it's kind of similar to the Pearson coefficient of correlation. And what it basically means is if R square is greater than, it's also, it's also, it's marked, it's measured from zero to one. And for our case, the value was 0.05, which seems a lot less than what we would want. So, so from the observations from all of these observations I could, I could say that we should not use it. I don't think it would give us any kind of benefit in predicting the size of the repository. Did we, did we, I guess, one of the questions I had was like, were we actually measuring whether number of refs influenced the speed of the fetching and stuff like that too. Because I think here you're trying to say like, oh, number of refs would give us some kind of deterministic measure of how big the repo is, but did we actually see if a number of refs impacted performance in a similar manner. So just in this, what you're saying is what comes under the benchmarking experiments we do to actually analyze if the performance is affected by the structure of the repository or the parameters we have for the repositories like the number of refs or the size, the size of the objects. This experiment was solely, it was constructed just to find out if the heuristics we want to use to find a way to actually estimate the size. Is, is our heuristics summed, is it actually going to approximate the size. So we were not actually sure because previous experiments with the benchmarks I also saw that the number of references were not positively correlating with the size. I could not find that, but I did that with just four repositories. So I was not sure personally. So I thought, let's take a large number, say 1000 repositories and then maybe 100 200 and let's see what what the data shows us. So, so this, this experiment aims to do that. What you're saying comes under the benchmarking and I plan to do that. Omkar has actually initiated a great step towards it. We discussed that in the, on the Gitter channel. We will pursue that to find out if the number of refs actually they have some significant effect on the performance of GetFetch. My apologies. I just realized that that was something in our, in our next phase also. So, but this is fantastic. This is really good data regardless of I should have started with that. This is really good data. Okay, so So that that graph that you just presented is brilliant at showing the question I had, hey, could I see the raw data you just showed the raw data with that graph. It's, to me, it looks very clear. No way should we trust the, the output of LS remote as a, as any kind of an approximation of the actual size of the repository. There's just not enough relationship between the two to trust that data. So, so better to say no heuristic using LS remote than to use, use it and have it be wrong so frequently. Good. Excellent. Well done. Very good. Thank you. So, after that, so the question I had after finding this I found this on Sunday, I think, and then I had the question that so what should we so the estimator we have now. It has only two as possible heuristics right now we have and the first one is that we look for cash and local depository get directory if we had have that then it's the best thing we can have then the second we were thinking was to to expose an extension for the plugins which provide get services, SCM services. So, so with with those two heuristics my concern is that both of them for both of them we it's like a conditional thing. It's, it's not necessary that we will be able to tell the size of the repository, because if some if get, if the, if the GitHub plugin which is depending on the get plugin has not implemented our extension, we will not be able to use that the GitHub API is to determine the size. It's a sure shot way to know the size that we know, but if we're exposing it as I assume that if we expose it, it has an extension. It depends on the plugins which are implementing that extension only then we can use that heuristic. So that makes us prone to this to the fact that whatever heuristics we have they are conditional in the sense that they might work or we, we might tell the size we will be accurate about it, but there is an equal of probability that we will not be able to use the estimator API altogether. So that is something I'm a little as a little actually upset that we should give why doesn't get not provide anything which which would not depend on any service provider or the local cash. So I actually started to look into the get internals. That is the plumbing commands get provides and it was actually a very confusing and long road. It seems like a long road. Now I started to look into it. I was looking into how get fetch is implemented. And then I so then I had this idea that I should look into the J get source code, because they've all they've implemented it in Java and I know a little bit of Java. So that's the best way to actually look into it. So I started looking to the classes, and then I actually found the class called pack configuration, which so, so, so I thought that we could have a heuristic, which would estimate the size which would actually estimate the size of the pack object, the compressed object. And if we know that we have a lower boundary on the size of the repository, which is is something, if you don't have any kind of size. But so the pack config is basically a configuration which when jget is trying to pack the object. So what I'm not able to figure out is what I saw actually to my own, this to my own self is that to pack an object or to use that pack config class jget is basically downloading those packed object first, and it assumes that it exists in a local repository. So, so that kind of disappointed me and I, and I leading to I think an observation that it might not be possible for us to look at a remote get server, and then just look at the pack object and get its size. As far as I can understand how get is written, it's, it's meant for us to download the pack object first and then do anything we want to with it. It's not possible for us to look at it, and then decide if we want to download it or not look at this size. I was thinking that maybe I was also looking at the transfer protocols get provides but I was not able to figure out anyway I looked at stack overflow I looked at a lot of internet resources but people have not found any consolidated way to do this. So, I'm actually confused as, is it even possible if the heuristic in itself sounds interesting, but I'm not sure if it's possible. I think what I think is like, like trying to get the details of the dot pack object is somewhat similar to the square one approach that we were trying to get the count object, get count object. Come on, it is somewhat similar like if that's possible to get, get the remote details with that command then it will be possible. Just check check the similarity between that I think there would be some similarity. Get count objects in. There was the initial. Yep. For get count objects I forgot did we actually look at, did we point it to a remote repository or a local repository were we counting, we were counting the local repository. I think so. Then it would not we actually need something which would point at the remote repository and give us. Yes. Yeah, so the issue. I think there is some similarity between these two like impact objects that this count gives and the dot pack you're looking into. My question is that even if we look at how get count object is working get count objects object assumes that we have the local we have the repository in our local system. But what we want is we don't want to clone the repository want to look at the remote server and get the size. So I am not sure if this would give us the answer. This is not giving us the answer but what I am trying to say is if you can. If this provides some functionality to look into the remote that would be somewhat similar to what you're looking for. Yeah, I was I was taking home cars comments to mean that get count objects is another evidence that what we're what we were thinking we could do is probably not feasible. Right, there isn't a way to ask the remote repository give me your size, except through an API call like GitHub. So, so get get itself had no interest in and I could understand Linus doesn't doesn't actually he didn't want to create a source controls or he didn't want to create a competitor to source control management systems he wanted to solve his problem. So, so he he didn't put things into the protocol he didn't need. Yeah, so I think that's what happened with the exploration or another heuristic. Um, so is it for us to. Yeah, sorry, go ahead. No, no Justin, please. I don't have anything great to say. No. Yes, you do sir. No, what I was going to say is I wonder if that would maybe be a reason to explore maybe the first time that we pull a get Rico is, you know, going to be whatever it is. And then, after that, we would have a determination of the size of that repo. We would potentially be able to track that over time. If it's grown or decreased. Is that that happens to you. That's an optimization on top of that but I wonder if that would be a way to go about it or if you thought about anything like that. So what I was assuming with this thing was that the cash would be a place where if we've cloned the repository for the first time, I, I haven't looked in the code that much that I am 100% sure but I assume that the cash would have the repository there. The local repository would be cash. And then I would have a place where I can know that okay this is the size of the repository and for the subsequent days or if you're using the plugin for anything after that, we could then optimize the operations. Did I understand you were just in. Yeah, I think it could either be done with a cash or or sorry, there's a because I think you're talking about a. I think Oleg was saying there was maybe a get cash that it's that hangs around a little bit on the master or the main instance. Yeah, so I mean something like that could potentially be a way of doing it or potentially like, even working off of what one of the agents get repose and determining the size after you pull. I'm not sure how much impact that is on performance to I suppose but yeah something something like one of those kinds of approaches. Okay, I can look into it. So what you're saying is that I, I, on Jenkins agent, I pull the repository I clone the repository just to check its size. You're not saying that. I'm saying like you would use so if I'm a as a user, I clone a repository and that a lot of times I believe most of that work happens on the agent. If you have more than just the master, the main. And so you would be able to just use whichever whatever disk is has your get repository, you would use that after you've done your clone potentially. If the cash isn't like if that cash is reliable, then maybe that's the right way to go. I'm just thinking of like read scenarios, and I know very little about the cash so perhaps marker. So that that that that cash on the master is, is used by multi branch and so users that use multi branch will tend to have those caches already populated. And, and so I think, I think it's a good excuse to encourage people. Hey, if you're, if you're doing this work, we think multi branches the way to go anyway, use multi branch and you'll get the benefit of this heuristic already. So my hunch is that most of the time, the heuristic will be satisfied, because the cash is found some some users may come to us and say oh no I'm only using freestyle project. Sorry, you won't get the benefit until you've cloned it at least once then. And for me the fallback is still the fallback issues command line get and command line get is the best performing in large repositories anyway, right. And therefore the places where we can gain benefit by switching to J get look like they are smaller repositories where we might get faster but the, the total savings is incrementally much, much less by the switch from command line get to J get on small repose right there's, they didn't take long to clone originally. Yes, exactly. I just wanted to ask, is this the cash you're talking about locking the cash in the multi brand project. That's the one. Okay. I'm not even sure you need to acquire a lock, because I think all you want to do is read directory contents. And if it's inconsistent or imperfect, you just don't care. You're just trying to get a quick approximation so I don't even think you need to acquire the lock. You just get the look get do the get cash entry. That's telling you a directory and now you can you can go out and do file system level access to that directory and, and count up its, its contents and the size of its contents. Okay. Okay, I'm looking to it. So the last thing I, I already discussed it before the meeting started actually officially. So this was just this is just how I'm, I'm actually creating or designing the new, not designing creating rather cloning the gate SCM telescope for our needs. The size estimated class. So I was so before creating the class I started to look at the SCM API and where at what level I am going to use the API we are creating and how am I going to What all should I provide with the API? Is it just a Boolean where I say a Boolean or something like a decision that use jk.git or is it something more than that? I was, I was looking at those things and it's mostly exploration right now I haven't created a prototype is even but I was thinking that first understand the level above the get SCM that is where the builders work to understand it first well and then use the class because that's where the class is going to be used so might as well understand the overall process first as much as I can and write the prototype. So, so with this. So this is what I've done with that the get size estimated class. Yes. So, yeah, this is, this is it right now. So the takeaway I have is that we have two heuristics right now and we would want to work with us. We're not looking at, I should not look for something else because I invested a lot of time into looking into the gate internals and jk. So I should probably should I explore it more or should I work with the heuristics we have. Okay. Heuristics for me anyway unless the other mentors have a different opinion for me, the heuristics that those two are already going to solve many many cases and no heuristic the fallback is still okay. Right, our fallback our fallback decision use command line get is not a bad decision. Yes, you know, yeah, that's true. If the heuristic. It's disastrous if the heuristic is wrong and heuristic tells us that the two gigabyte Lennox kernels should be downloaded by J get that's a disaster. Right. That's an unlikely unlikely outcome given the given the correlation the negative correlation or the non correlation you found between our estimators and repository size. Okay, so I guess I love so I'll be updating the track. Yeah, yes, Mark. I had a minor business item I'm going to be out next week. My son in his mid 20s is getting married. And so I will be I will be unavailable a week from Friday, because we'll be in the middle of all sorts of things in a neighboring state. And I probably will be on a bit unavailable a week from today. So I'm, I'm not I we may need to have someone else host the zoom session. Rishabh if you want to create that you can do a private or you could create your own zoom account. They'll do free free sessions up to 40 minutes. So that would limit the session but you could do it that way then then host it I apologize I can't but my son's getting married and I'm delighted. That's a good congratulations. Congrats. He's a grown up now. It's wonderful. We're delighted with that. It's a great thing. I'll host the meeting with my personal zoom account. That's okay. Great. Thank you. You. Okay, you will be unavailable starting this Friday. I will be here. I will be here this Friday. I will attend our session this Friday. I will attend next Wednesday. I will not attend. I will probably not attend next Wednesday. I will definitely not attend next Friday, but I will be here this Friday. Okay. Thanks a lot. All right. Okay. Yeah. Okay. Thank you guys.