 Hello, welcome everyone. So I think we can start right now. So this is the second deep type of the create team. And we are going to talk about the diffs and commenting on diffs. The first one was with Tiago. It was great. And we have the links on the issue on the create team. So let's get started. So today we're going to have a look at what exactly a Git diff is in a level of Git itself, just a brief overview. Where do we present diffs on GitLab itself, a quick demo introduction and overview of how it's stored, how do we fetch and present that both on GitLab and GitLab itself. So we are going to have a look on the standard comparison view, the merge request, diffs on merge request, and this four comments on merge request or commits itself. So a little bit what we are going to do today, we are going to have a look on the demo workflows, the tables that we use to store the diffs, the caching layers in a brief code dive. We don't have any free to ask any question any time. There's no issue to interrupt me, but we are going to have a question Q&A in the end. So basically, what exactly Git diff is. So Git diff is a Git function that takes two data sets and outputs the difference between those. So we can pass a commits range files. And it's interesting because this structure shows that we have a commit and it points to a tree. So let's say we have a readme that points to a blob and we have a lib folder that points to another tree. So that's basically how the diff is structured, the object on Git. So we can use these shards on Git diff itself. So it's interesting because it's a very powerful function and tool. And we have most of the functionalities, but we don't have everything. So it's interesting to keep in mind that we use the most famous part of the Git diff on Git lab itself by the UI and wrapping the client, but we don't have everything. So on the left, basically what the Git diff is on the terminal, a simple difference between files. In this case, we show the header here, the hunks that we can call Git diff hunks that represent the headers. And on the right side, how we present that on Git lab UI. So we can see that it's a little bit improved. So let's have a quick demo here. I'll quickly show. So this is the repo compare. I'm not seeing the chat right now, so let me just check if there's any. Okay, there's nothing. So we have the repo compare here. So we can basically what we can do here is selecting the source branch or a revision, a comment itself. And the target branch click compare and see basically what exactly Git diff does. We compare, but it's interesting to think that here we are not patching the database at all. So we don't have anything persistent here. We can't leave comments. We don't have a more requests. Obviously, we have just a few files being being showed the difference between those. We can actually view the blobs of the files here. So it's also nice. And we show that a button here that we already have the more requests for this difference. So we check that we have the source and target range. And more requests already created for this. So here's the the merge request UI, as probably you already know, we have the changes here. That's mostly the same thing of the comparison. But here we have a nice functionality of leaving comments anywhere. Exactly, because we can actually unfold the lines now here. It's a new feature. And we can leave comments here as well. And any comments, a comment that we leave here, we just track those here as well. So we also have this here. See if you think about this, this part of the code base is just a little bit of context, not the whole defile as we present here. So it's, we are going to just see a little bit of how it's happening behind the scenes. So let's go back here. All right, so let's start with the, let me move this here. Okay. So let's start with the workflow of the standard comparison view deeps compared to the other workflows it's a little bit simpler because we don't persist anything. We don't actually need to persist right now. So fetching, we just submit a request to Githy. There's an RPC that we use to fetch deeps that's common, actually use it almost everywhere. That's the commit diff RPC. We are going to take a look this on the end. And we send the limits and in the reference that we are going to diff. It's interesting to think about limiting because we don't, we can't just fetch everything because we can't handle everything possible. So imagine that we take the first revision of GitLab and the last revision of GitLab, let's say the master, we're going to have a huge amount of defiles. So we need to limit this somehow. So it's nice because before we are, we were limiting on the GitLab C code base itself. And we started moving these limitations to Githy. So that's the source. So it's, it's interesting to think that we are limiting the diff on the source itself. We are fetching then putting on memory and limiting on GitLab CE. So it's, it's better to have a service that already does this and it's going so it's faster. And for the presentation part, we on the GitLab C side we fetch those, those defiles as a string. If you think about this is the diff is basically a string. We put this on the defile in the high-level overview and we parse this string and the output of this is the define object. We are going to have a quick look what exactly the define object is and something to have in mind that we are not loading async on the comparison view. We are just loading everything that wants a plain, plain request, HTML request just once. That's because we, we are actually planning to, to move this forward. I think we have an issue for this. I'll just link when I have it. But we are thinking, probably thinking on loading those this async. So that's something that we already have for the merge request. So the merge request is to let's separate this on three parts, like storage, fetching and presentation, because we have storage and all. So there are two main tables that we have to have in mind when working with merge request tips. Those are the merge request tips and the merge request defiles. So in a high-level idea, the merge request is basically the version. So each push that we sent to a merge request, we create a new merge request def. And this merge request def points to merge request defiles. So if you think about it, every version will create, it creates all the merge request defiles. So that's interesting. At the same time that we have everything storage and we have the tracking of everything on the database and we don't have to go to Italy to fact this. There's a downside, because we, we just built a little bit of the database for doing this and we are planning to move this, this through to outside. So using object storage, for instance, we have an issue for this. So basically what I said before, I think there's a question. All right, Nick, just link it to each. We have the link here as well. So thanks. Oops. So basically the workflow is when a purchase receive it on the source branch, we create the merge request version that already said. So we have this class to basically create the merge request diffs and clear the cache, the highlighting cache. I will talk a little bit about why the cache did the highlighting, but we erase the cache here. Just a quick overview because we use Hooch, a Hooch gen. Imagine that the input is the base, the diff, the plain diff in the output is the HTML that we need to handle. So if you think that on the go, we need to do this every time a request comes in. Imagine that. Yeah, it's a really heavy process. And we started doing this for plain merge request diffs and after this for the comments diffs itself, but we are going to tell you later on. So as I said, we started thinking about the size of the tables that that are going way too high. And Nick's already working on on this issue that that's going to start the object storage and probably we are going to have external reference in the database. I still don't know. But let's let's see the merge request after Yeah, so the the fetching workflow basically instead of going to Githy itself we go to the database, we are going to have a quick look on the code dive after this but there's a class interesting to have a look. And we refresh the cache so the cache is written when we when we load the diffs itself because we we raise the cache after seven days so that's the the the expiration. So when we first load we just rewrite the cache as well. Part of the process also fetches the diff stats. And that's that part is interesting just a second. Now this part is interesting because we use it to parse the the actual string to get the additions and deletions and the file path we already know, but we had to parse the string. So I think this is a little bit boring because we have to load this on the the memory and parse everything that's CPU bound so it's not that good. So we left this job for RPC that we just request and under the hood we use the diff stats, the flag. So it does the job for us. It just returns the additions and deletions. It's nice. And there's another point that I showed that we can expand the diff and leave a comment. So we use the GitLab diff lines on folder to like inject because if you think about what the diff file is is basically a small part of the file. So if we expand this on the UI, we are requesting this for blobs. So just merging blob lines. And we have to re-merge this when we're handling. So the idea here is that if someone leaves the comment outside an expanded part, we have to merge this on the fly to have these lines from the blob itself. So the diff gets bigger because we are merging from the blob lines. So that's a kind of a hacky way but basically the only way to do that. So the presentation workflow, unlike the standard comparison view, we load those as spacing and we use the diff file entity and basically JSON. And if you think about the size that the diffs can get, we started seeing that a few diffs were getting way too big. So the JSON, actual JSON after processing everything and trying to load, we started seeing a few diffs that have eight megabytes. So more than that, sometimes. So we are thinking about strategies. So at first the easiest one would be just checking and making sure that we are using everything on the front end being serialized. Because sometimes we reuse a few serializers and sometimes we get more things that then we actually need. So that's CPU being wasted. And that's also, we have to do to actually download the whole JSON at the end. So it's not good. We need to be aware that we are actually using everything. And the second part, that's also interesting that that's an issue being linked here, that we are discussing in a smarter way to to handle the diffs. Today we first on the first load, we show the merge request widget. And we just to start loading the diffs. So it takes some time depending on the type on the size of the diffs, the diffs itself. So we started thinking, okay, so we can load at least the first part of the diff at least two or three defiles and sequentially batching batching the request for the next one. So when the user first loads the page, he can see at least something. So it will probably lead to a better UX. So there's some interesting discussion there if someone wants to take a look. So the caching layers, as we already talked a little bit. So the idea was that at the time that we created this this merge request defiles table. We had the mission to load the the merge request diffs after the fact that it was merged. So at the time we, I think we didn't have the keep around graphs. So we started losing some references. So we needed some some way to keep track of the diffs after the fact that it was merged. So we started persisting the merge request defiles. So as a side effect means we started seeing some performance improvement. So in the red side, we, as I said before, we were just doing on the flight the the processing of the merge, the, the highlighting using root. So as I said before, it's a slow test. So to make under a request. So we started caching the this part already is we are going to see a little bit on the on the code dive. So I think the last part is the discussions that if you think about what we present there. It's just a small amount of the deeps. And we are going to just see a little bit about this also as a storage fetching and presentation speeding. The first part of the most important columns that we have on notes. So it's the line code. That's basically the file path as a shot, and there's the old line in your line. So it's basically use it to link to the diff itself when we leave a note there, we can see that there's a link. And that's the, that's the, the, the link itself, the, the anchor that we use and position regional position and change position there that we are going to talk a little bit. The second table is the noted files. And if you think a little bit, we could be using the merge request defile because we already have all versions right. We have all versions of the merge request defiles on the database. But the problem is we started deleting after deleting the merge request defiles after the merge request was merged. Just keep them keeping the latest ones, the latest version just to show the diffs quicker, but we started deleting to avoid a little bit of boats on the database, but it wasn't enough. So we are working on this right now. So the main positions that we have are the original one, the position itself and the change position that we start as a serialized object that we can see down below here. So the original one, it presents the original state. So imagine leaving a comment on the diffs tab. So that's the object that we won't change. So that points to the original version. So let's say, so we use that on the discussion that if you think about that, the diff is not is never changed on the discussion that because that's the state that that it was at the time that you left the comments. The original position is the responsible for this position basically follows the merge request version, unless the line was changed the exact line that was left the comment was changed. And the change position tracks when the diff, the exact diff that the line that was commented was changed. So we, that's the that that's because the name change position. So we start exactly the, the, the point in time that we, we changed this. And that's that's what the user here to present this, this for instance version eight of the diff. So that links to the diff itself so we can see the difference between version. So this is an issue for actually presenting the difference that just linking. I think it has has the link out or just leave the link on the comments here. So, so, so basically the storage workflow I think that the main takeaway here is that we don't persist everything. We need the, the from the top most position of the diff, the actual string to the commented line. And you can ask why we need the, from the top to the, the, the comment, the, the commented line. That's because we need the header information and the header information normally stays in the top most position. So it just cuts the diff and persists this would be would be possible to just cut the context that we need to write the, the header on the string, but it would be a little bit more complex so we just cut everything that we need and persists right away and use that to parse and do the, the whole process that the red works for, for parsing the diff and creating diff lines that are already already explained a little bit. So when the diff is updated, a push is sent, we use the GitLab diff position tracer to just trace where exactly the position is to be on the next diff. So imagine that all push that's sent. We already have the comment there on the diff step, but we need to know where exactly the comment needs to be in the next, in the next version. So if the line, exactly line, the exact line that was commented was changed, it's updated so it won't go to the latest version. So if we can track the line that just changed the position, let's say it was, it was in the line 10 and it was moving to line 12. We can track with the diff position tracer that we need to present in the next version in this exact line. So that's the guy that does that. So the caching layers, we have postgres again, that we just persisted diffhunk. So as we already talked, we, and the reason why we started persisting this is that we were not noticing that we are request, we were requesting deeply all the time. So imagine that we have merge request with 100 comments. So we would need to be fetching deeply, like 100 times sometimes, depending on how many revisions that we are requesting. So we started seeing that it was getting a little bit out of control, and we just noted that we could do this little persistence that would improve a lot the performance. And the same problem that we have with the merge request, the files, the actual diffs of the merge request, we started seeing on the discussion tabs as well. We started seeing that we were processing the highlighting the actual HTML taking a lot of time to process this. So we use it basically the same strategy. So I think we have some time for questions right now before the code dive. The question is, could we go away with the code, the keep around ref now that we have the merge request, the files. Yeah, that's a good question. Actually, yeah, I think we could just reuse we would probably would see some performance like getting worse a little bit, probably, but that's doable. I don't think we could because if the merge request is merged at some point the and the actual commits that made up the original diff somehow are not in any branch anymore and they get cleaned up. If we want to view those blobs or if we want to expand some of the hidden diff lines, we still need to have access to the underlying blob, which might get cleaned up if we don't have a keeper on trap. Okay, so that's me. I think so, though, if I'm mistaken, I'd love to see an issue about this. Sorry, Nick. We just keep generating more and more of these reps and they are a bit scary. Yeah, absolutely. There has to be a way and I think that the CIT was looking at this, at least years ago already of kind of deduplicating them or reducing the amount of keeper around reps to the minimum set we actually need to cover all those commits because obviously that's the goal. And there are a lot of redundant ones that we should be able to remove. I don't know if anyone's looking at that maybe we should be. I think I created an issue a long time ago, just talking about, do we actually need to persist everything? Could we just request Githli all the time because that's not that bad sometimes, so would be reducing the complexity just having to keep everything on the database so that would reduce the problems with table sizes. I'll just link the issue that I created sometime ago would be nice to discuss this there as well. So let's let's have a quick look on the code. I have a few files here. Let's see. So the first one here is the demurge request reload this service. As you may think that that's it. That's used when we receive a push. So it does a few things here. It creates the module past it and create the module past the files as well. It clears the cache that we have because we need to get the latest. And this triggers the update of the position so we need to know where the comments will be so it just triggered the positions update. So this is a lower level would be nice to actually present the the Githli part just just one piece here that I that I separated. This is the main RPC that we use to fetch the diffs and it's pretty straightforward actually we receive the left left comment and the right comment and receive the paths here as well. So let's look here on the arcs that we are actually calling the gate itself here. So he also here we can see that if we're enforcing limits that that comes from from the seed they get left part actually declined. So this checks that we if you are enforcing limits we are actually using the limits so we are setting everything here. So on this part here we iterate on the on the diff. And I think that it returns a stream of diffs that that's the these are the objects that that are returning so lots of them. Yeah, it checks that the the deep pad size is less than we could send so we need to to just stream that. That's interesting, a little bit of go code that we use a lot. So let's go back. So, as you've seen on on the Githli side that we use the limits being sent from the declines that's the key that code base itself. We use this we set this here. So this is the class that get lab gets deep collections basically the class that wraps the what what's coming from from Githli. So that there's some logic here that we could take a look really quickly. So let's say we can see that it takes a look with it sets the overflow. When it overflows like expands a certain position a certain limit. I think I think part of this is already done on Githli Githli side so we need to delete actually delete some code here I think. So for instance, we check actually it comes from Italy, we can see overflow mark marker here. So we just set the overflow to true overflow. So we just stop Henry. So it takes some some logic of limiting as well. So, so this is the lower level wrapper, let's say, on the right side here. We have also have a lower level object that that fetches Githli so it also fetches here based on this between method. So also interesting to know this class. Let me just close this. Yeah, so that that's interesting here. Let me just a second. All right. So I've opened this because it's interesting to see how the difficult collection works. This is the difficult action with the class thief. And if you take a look on the on the subclass here that the base class. We always call this defable raw diffs. So basically what what exactly defable is so defable is basically a compare object that I've showed it before here. It's basically a compare method or a merge class diff. So things that we can call raw diffs, for instance, and each each object have his it's particular way to fat diffs. So if you pass the compare, we are going to to fat from the repo itself because the raw diffs goes to the repo itself. We pass the merge request diff. And if you take a look in the implementation of the raw diffs is it actually goes to the database instead. So there's a nice abstraction on this that we have the file collection merge request if we have the file collection compare. So each of those super classes receive this method that represents this class and knows exactly how to fat this these diffs. So it's a nice overview. So for instance, on the merge request diff, since we use the cache, the def files just use the cost cache decorates for every defile. Decorate itself don't request the the redis. It just takes what's memoize it and just assigns to to highlighted def lines. We can take a quick look here on the deep. Yeah. So if you take a look here, it just assigns the highlighted def lines and we just iterate on the content and create the def lines just create instances of def lines, based on what we had on the cache. So we have the def file here. There's a little bit too much. Actually, there's a lot going on here, but let's take a look on the def lines that we can understand a little bit of the flow of rendering. We can we call this lines we use the parser on the raw diff itself that that's the string we pass the iterator here. The raw diff each line. And we can take a look here that we parse the actual string. So those lines are actually strings array of strings. And here we can see that we check the we found the header there. So since we know the headers there we know exactly which line is there so we can create multiple lines. So that's the core of rendering the, the def lines because we serialize those lines in the front end just takes takes everything that's already there and just handers in theory. So hopefully. The lower a bit a little bit lower level rapper of the diff. It knows a little bit more about the the limitations itself so it takes some some constants here to know what's the size that we should be using as a limit. It, it's also knows the that we, we should request giddily the repo itself so it has a few methods that that's interesting to this part of fetching so not much that high level so so things that are more related to fetching. Let's see, let's see. Yeah, it memoizes a few things like to large and collapse it for instance so when this object is faster to def file. It's already memoized it so let's say this logic this logic is already like let's say prepared and we serialize this these attributes. So the logic of knowing exactly how to memorize those those attributes should be here, most part of those. So here's the line that we created from the cash that I just showed it takes a few, a few attributes here, and you can take a look on the cash again. We call the niche from from hash. And that's, that's what it's taking here. So it knows how to respond to edit remove it meta and change it so it knows what exactly the line represents. So basically what the object might do should do actually. And yeah, so we already saw. So this is the entity the serializer itself for each deep file. So it knows just how to serialize. So that's cool. I like the deep lines it calls the deep file. So what's it's calling here is the deep file that I just showed. Hopefully we are going to remove these parallel these lines and just do this use just the deep lines for serializer. I think that we have we have a merge request to just remove the parallel lines and do these in the front end. It's good because we just remove a few other a few extra megabytes maybe or just. Okay, nice. But at least we are just making it a little bit faster but it's interesting to think that the front end will be doing a little bit more, actually. But yeah, let's see. So the deep notes. If notice the persistence layer. We have the few validations on the original position the position line code. I think that's interesting to show the main part here is that the fetching of the defile. And if we take a look here it's pretty straightforward we check that we have the persistence the persistent no defile that I just showed that we have a table there. And we just create the instance of the deep file. We have the persistence no defile because it's an old merge request it's not created yet. We just try to fetch this from the merge request defiles from the the merge request itself, because we keep the latest so sometimes, sometimes you just can fetch from this from the database instead going through the least performance part that is just fetching from from the repo itself. So as I said, we will do the some magic on the unfolding lines the lines on folder that we can see here, basically this this lines on folder class. And the whole point is that it takes the position. And it takes the defile, and it knows which line is that exactly we need to fetch from the blobs to present this block of code that we did the user just commented. So basically the return of this the unfolded defilings are the defilings processing with the blob lines that the extra blob lines that we fetch it. Just a little bit of magic, but it makes sense. This part here. Okay, so the, the GitLab discussions the file collection. The idea here is to wrap all the discussions deep. So imagine the discussion that that we have on the merge request. This guy here just takes all the defiles on the truncate the chunk defiles and knows how to load the highlighting from the cache it knows how to find by the ID. So it's basically a wrapper of the discussion. It also calls the highlight cache that we can see here. And at this point you might be asking yourself, why do we have two different classes to cash to cash the same strategy to have the same strategy of cash. So the actual idea here is that on the discussions deep, we could just erase the cache when someone just delete the, the comment. So it's a little bit different from the actual deep of the merge request. Here we are able to, let's say just delete the, the cash when when just one on one defile cache different from the the other cash that we just write everything at once. And we don't have separate keys on this on this cache here. We use a little bit of lower level of the redis, but we are also able to pass multiple keys and write everything at once. So for now we are not erasing just one defile just writing and reading everything. We are just relying on expiration time that seven days. But if we want to just erase start to raising more quickly, we just could implement that really easily with this guy. So again the line some folder that we already talked about. So this this is really, really big and a little bit complex class. So the idea here is taking the old ref and then real refs on the point when the purchase received and the idea is knowing where exactly the the comment needs to be when receiving a new push. So, I think they, the output here is a result. I think though it knows a little bit better but basically the position and if it's updated or not. So the new position and if it's updated or not. It's it returns what we need to know to persist the position. So the last class here is basically the, the serializer for the discussion. It's just knows that we need to took out truncated defilings to to return the defilings to the front end. So it's pretty straightforward. I think that's it from the code dive here. And I think that's the last question part. Anyone wants to ask anything. Everyone's confident that if I assign them a murder case diff feature next month, they'll be fine. I have a quick question. Do we really handle image diffs and if so, how do you feel it fitting into the current architecture. Sorry, could you could you repeat. Do you ever handle image diffs like a side by side on a merge request, like having images. What changed or something I don't know if we currently do that and if we don't, how do you see it fitting into the current architecture. Yeah, that's that's a good question actually, we do have some, let me take a look here to the position, if position. We were talking about the position itself. So what we do today is knowing the widgets and heights of the image and knowing where exactly the comment was made by the position. So X and Y. So that's a good question actually to have a side by side. I'm not sure right now, but I can come up with a good, good response for this. I'm not sure. Sorry. So image diff commenting functionality. I think by default that we have that option of switching between old and new and using that, like onion paper kind of you, where you can drag the line and see the images kind of turn into the other one. And we support commenting there too and that's what I was pointing out with those X and Y coordinates which we use instead of new line and old line when you're commenting on a changed image. So that's part of everything related to image diff commenting is implemented on the front end, only the diff notes, when we store them of course they need a diff position to know what they are attached to, and in this case that's an X Y rather than an old line and diff line and file path and file path of course is the case for images too but X and Y instead of old line and the new line. So, from the perspective of the architecture, when you have a collection of diff files, because of course diff is a collection of diff files and then each diff line. If it text files, diff line will have a number of diff lines, but of course if it's an image diff file that is one. But the diff file implements a viewer method that returns a viewer that I think I've already mentioned that as well, which in the case of text diffs will be a text diff viewer in the case images will be an image diff viewer. And this is also communicated to the front end so the front end knows which diff rendering and diff comment rendering logic to branch into. The diff viewer image, it doesn't implement all that much logic here because it includes client site on line six, which means that it leaves the responsibility of actually rendering this diff to the front end. And the front end. Yeah. How to allow commenting there. Yeah, that's right. Thanks for that. So I think that's it. Everyone enjoyed. So see you on the next deep dive. That's soon. We can take a look on the issue that there's a few scattered already. So that's cool. I think the next one's going to be on get LFS in about a month by fun. If no one else takes a spot in between that. So that's it. Thanks a lot. Thanks.