 Hey, Stan. Thanks for joining. I wanted to talk with you a bit about GitLab Geo and where the architecture is going. So GitLab Geo for the people watching is our product that replicates GitLab across geographies and it works by keeping track of what things have been replicated. The most important thing is repositories in a database. I've seen things like problems. Well, the products are getting a lot better. We're using it for our own migration, and I think there's been a ton of bug fixes. So we're doing really well there. But I've seen some comments about lock files and about cursor source, and I was wondering what was behind that. Can you maybe explain a bit why we have lock files that need to be cleaned up? Lock files. I'm not sure what context that is. Maybe we don't have lock files. I might have been in the release post. I think that the lock files, what we talk about there is that when you do something on the Git repository and Git will create a lock file basically saying, hey, don't touch this reference because I'm doing something with it right now. So what happens on some customer's instances is that these things get created but never get cleaned up. So when they try to do something later, it says, hey, I'm stuck. I can't do anything about it. So that's not really a Geo problem. That's not a Geo problem, but it bites us because we've seen customers say, hey, I can't replicate this thing anymore because this lock file is staying around. Because of the lock file, Geo can replicate. And also the customer can't do anything, but it shows up in Geo first. Exactly. And we see this on gitlab.com too. So Kit Ali has a thing that basically says, hey, this thing is stale for longer than X amount of hours. It's no good. We're just going to remove it. And so we're now taking advantage of that. We just say, hey, if there's a failure, let's try cleaning it up. And hopefully that will resolve the problem. That makes so much sense. Okay, cool. And then I get it. And it's Git lock files. I was like, oh, Geo has lock files. We got a problem here. Yeah, yeah, yeah. Then the other thing I saw is that we're cleaning up or like cleaning up or deleting part of the logs. Well, we keep a log, I think, of what things we've done or that we've got to make it work. For example, every time you push a new commit to GitLab, we log an event in the database to say this repository has been updated. That table gets really big over time, right? So if you imagine, that's the biggest table that one of the biggest tables we have on GitLab is like every time somebody pushes, we have a new entry, new row. And that can balloon very quickly into millions and millions of rows. So at some point, you do want to clean it up because it's a waste of data to space and it's going to make maintenance a lot harder. So we have a thing in Geo that periodically goes and says, hey, can I clean this up and goes ahead and purges data from there? Why does it make maintenance harder? Just that you now have a bigger table, inserts are going to take longer. You got to back it up. So just everything about database maintenance comes in play here. That makes sense. Yeah, I was hoping that I'm like, oh well, databases nowadays can handle everything. Like they can certainly handle the few pushes we throw at them, but I understand that maybe it's, we're talking about tens of millions of. Yeah, we're talking about millions of rows now over time. Like it's not trivial amount. Makes sense. In an ideal world, you use something like time skill DB, so you can quickly drop it, but that's probably an optimization we don't need yet. We can just delete those records and then at least the B-Trade is not as nested. Okay, that's cool. Hey, then we had the thing with the cursor. I think the situation is that we set a cursor every now and then. What we've done so far worked until where we caught up. Right. That always seemed kind of a fragile system to me. There's some problem with out of order records. Let's go into that later, but I'm surprised it works at all. Isn't there like some event where we have a failure or something and a cursor can't move? Or do we then just say, hey, this thing failed, we'll still move the cursor in? We've got a process that we just send out to a background job and it's not a reliable key in the sense that we make sure we've actually processed that thing successfully. So we do have failures of these events. The idea, the hope was that we scheduled these as sidekick tasks. Sidekick has a retry mechanism. So you can say, hey, retry this job for X amount of times or infinite and it has a backup mechanism too. So we kind of put the onus of that retry mechanism on the sidekick. And then we had the problem, I think, where we moved the cursor along and then records appeared behind the cursor. That's right because Postgres has the way that the Postgres doesn't guarantee that every primary ID is actually sequential. It's really like a unique ID. So when you say, hey, give me, I'm gonna create a new row. Postgres gives you back the next sequential one, but that doesn't mean it's gonna be in the database. Someone's not gonna be able to see it until it's actually committed. So we get this leaky bucket problem where let's say we ask for all events above ID two, 10 might show up, about three through nine might show up later at a different time. So it doesn't happen that often, but it's just in a concurrent system, it's a problem. How do you think about fixing that? So the short-term way of fixing it, one way to fix it is just delay looking at the events for some time. So that's the quick thing we're starting to look at. Wait for a minute and... Wow. So we haven't actually committed that. We did some experiments yesterday and say, hey, that helps, but it doesn't completely solve the problem. Another short-term bandaid is just keep track of these gaps. See, like, hey, I processed one through 10, but I'm missing nine. So at some later point, go back and check, do I get nine at some other point and then resume. That's, again, a bandaid. It's not great. The way you wanna solve this is a more reliable queue. So if you had a queue that you could just worry about just taking an event and then processing it and just dropping it when you're done, that's what you do. So we're talking about with the database team about how can we use Postgres in a way that's more of a real queue system, right? Can we build a queue to do this in Postgres? And it's a hard problem because we're talking about a replicated queue, right? We're not talking about the secondaries only have read access to the database. So I can't write like 10 events in this table and then the secondary will just remove it from that table when it's done, right? So you need to either replicate that data somehow like using, you know, Postgres has this thing called logical replication where you can say, hey, I want this table copied over to my secondaries and then secondaries can do whatever they want with it, right? So we've got to implement that ourselves. We either use Postgres feature to do that or we have to switch to some real queue to handle this for us. That makes sense. I saw your suggestions there and I certainly hope we don't have to implement another piece of technology here. Just a question. I think because the database is replicated we cannot do a join with like this replicated piece of the database and then some other table where we keep our state. We can do that. And that's why we had this foreign data wrapper because that allows you to join across different databases. And so we'd use that to some extent but it's complicated because now you're talking about two different databases. If we could have one database at the geo secondaries working off, you know, you could basically be doing joins within that database. You could, you know, you hopefully have the event table in one database and we can do away with this whole to dual database problem that we have right now. Crazy idea that I'm sure you consider it but that's not practical but what about the secondaries and noting their status to the primary like so. So the primary says, oh, this is replicated now in the secondary. I think we talked about a while ago. I think the issue is that this, you know you've got millions of repositories. What would the primary do with that information? They would, the problem is a bit that you can have multiple secondaries. So you really don't want to change the schema and have columns there. She'd need to use something like to H store. And then I'm not sure how efficient your lookups are anymore after that. Are you saying like the primary would generate like an event for every secondary and say, hey, you know, delete this from your table when you're done with it or I'll figure out, you know I'll remove it once I get a report back that the secondary is done with it. Kind of, yeah. So actually notice status like there can be multiple secondaries. So you kind of say, hey, this thing needs to be hey, you generate a record like there was a change. This needs to be replicated. And then in that record you're so, oh, secondary one replicated this, secondary two replicated this, et cetera. Yeah, I mean, as you add more nodes, obviously the primary is gonna do a lot of work, right? The primary is gonna be receiving all these, you know requests back from the secondary is gonna be updating its own tables and it becomes, you know it might be a tough thing to scale because you add two or three nodes and that will, you know put a lot of load on the primary. That makes sense. Do we have people like running multiple secondaries? Yeah, we have, you know I saw one customer running four secondaries. Yeah. So and the primary is always gonna be the bottleneck so we really can't have that. Okay, that makes sense. How the secondary is, I always think of them as pretty over provisioned. How hard is it to just kind of copy those records that are coming off the primary and just copy them in their own database table and then you can start modifying them. Yeah, so I mean, that's basically implementing a logical replication ourselves and then you have the same issue of like how do I know what I have been copied, right? It's the, And that was like, we used to send like giant sets of IDs. Now that obviously doesn't scale and then, oh, you could do a cursor but you have the same cursor problem. Yeah, exactly. And I thought about that too, the same problem. So we can either implement it ourselves if we have the same problems or we let Postgres kind of solve this for us. Yeah, we have an issue to kind of investigate. You know, if we did Postgres Rep logical replication, what are the pitfalls? You know, this is actually help or it's gonna cause more problems. Yeah, makes sense. Yeah, and it's hard to kind of what, so the question we have to answer is what of this table haven't, which records of this table haven't I seen yet? Yeah. Without modifying the table. Exactly. And this has been a, we have, we do the sort of thing to just figure out like what do we don't have on the, when we can try to figure out what repositories we're missing from the primer, we do these, a lot of these big anti-joins what they're called basically. I have this in my table, you have this in your table, what am I missing, right? And as you get more and more closer to the actual sync, that takes longer and longer because you don't find gaps anymore. So the database has to do a ton of work. So, we're already facing that kind of problem today. Yeah, and like bloom filters and stuff like that that all doesn't help with any of this. Yeah, you're, I see the problem here. The best bet is still probably logical application. Cool, so on a, the problem is clear to me. Thanks for that. And then when you have a solution, I'd be very happy if you ping me so I can have a look at it because this is very interesting. I'm wondering what you'll come up with. What Gio already has is that if you clone something over HTTP, you can push back to the same secondary server and that's, I think you'll be redirected or something. Go ahead. Yeah, and it'll be redirected. You'll, the client will be told to go push to this and it will automatically just redirect you to that primary and it will receive the push. That's awesome. When do you think we'll have something for SSH and what do you think the solution will look like? Yeah, so I actually done some cool work that actually demonstrates that this works. And we're trying to get into the next release. It's still being reviewed right now, but essentially when you push an SSH to the secondary, it says, oh, I can't, I'm not actually the primary. So then it'll open up an HTTP connection to the primary behind the scenes and then basically proxy that SSH get traffic to the primary and then come back because you can't open another SSH connection on top of another SSH connection that there's some security issues with that. So that seems to be working. We're trying to get that in for 11.2. It's a big change. So it may take some review cycles, but I've seen the demo. It seems to work at a proof of concept level, but there may be some more iterations there. That's very exciting. I'm very exciting that you found a solution there. And this seems, I've seen this before in the industry. So I think this is the way to go. So I think you figured out the right solution to this. I think after that, the next big thing is disaster recovery. Is that so? Is that the next big chunk we'll work on? Well, I guess what do you mean by disaster recovery? You mean in a sense is already kind of is a disaster recovery solution. There's a lot of, you know, we're doing more checks and verifications and kind of automatic recovery. If, for example, something is out of sync, we should go and, you know, we detect that something's not quite right and then go recover from that. So there's a lot of robustness, additional checks that we need to implement. Cool. What's the next big thing after the SSH push? Yeah, I think it's going to, you know, not only solve this, you know, this leaky bucket problem we talked about earlier, but also just, you know, there's a lot of user interface improvements that we can make that you can actually see right now. It's hard to see. Like if you've got 10,000 repositories that are not quite synced, it's kind of hard to see. You got to use the API to kind of generate this list and then figure out. So we'll have a UI that's much more friendly for administrators to go in and say, hey, this is a list and I can either re-sync them all right now or I can just manually do it or I can just wait for GEO to do it. There's a lot of, like I said, performance improvements that we talked about this whole anti-join problem I talked about earlier about, hey, how do I find out what's missing? That puts a lot of load on the database right now. So I think if we can solve that problem with the leaky bucket problem, I think we'll, you know, for GEO will be even more performance. So just a lot of improvements under the hood that have to be done over the next couple of months. Yep, and we're doing this because we're running at onkitlab.com which is way bigger than any of our customer installations so that they are getting really good performance and we're at the leading edge. Absolutely, most of the customers don't deal with five million projects and that are constantly being up, half of which are being updated constantly. So... Wow, we got that many of our projects are active. That's surprising to me. That was surprising to me too because when I did the numbers a couple of months ago, I said, well, how many of them actually received the push in the last 24 hours and it was something like 40, 50% of them? So it might be mirrors, but we have a lot of activity on repositories. Wow. Because we got, I think, 500,000 active users or something out of like two and a half million registered. So that's only like 20% and then half of the repos are active. It's super surprising. It is surprising to me. I'll run the numbers again, but it was surprising to me when I saw, I assumed it was like 10% or less. So we got five million repos, two and a half million active. So that was back when it was four million. So yeah, the numbers have gone up since then. So you think five million, three million active? It's possible. I got to do the numbers again. I'm not sure. And maybe my numbers could have been wrong, but we have, we now track the last updated date of the repos. So we have a pretty good sense of when we got a push from them. That's interesting. Well, bring me if you got those numbers. That's interesting. And thank you very much for the explanation. It's very interesting stuff. And I'm super glad that we shipped DO and that we're using it for the gitlab.com move between clouds. Me too, yeah. The next question is after the move, what do we do? I'd love to still have that resource available, but that's an expensive proposition. I think we should do that. I think we need some disaster recovery too. And I think it's, I don't expect we also need to be able to move gitlab between regions and this will help with that. Great. Awesome. Thanks, Ed. Thank you.