 Do you want to let me know when you start recording? I think we're going, so go ahead. Okay, so let me start the presentation. So, hey everyone, my name is Thiago. I'm a back engineer for the Create team, and I want to welcome you to the first deep dive of the Create team where we are going to be talking about the pull mirroring system. So, let's jump right into it. So, the purpose of this talk is for me to share my knowledge of the pull mirroring feature with the entire GitLab team and to make this deep dive session a reference for everyone that might need to work with pull mirroring in the future. Here's the table of contents. I'm going to give you a couple of seconds to keep this game through it. Moving on, we will not talk about the mirroring through SSH, push mirroring and bi-directional mirroring due to time constraints. So, what is pull mirroring? It's a feature that is available on GitLab starter and bronze tiers. And basically what it does is it automatically pulls the changes from an external repository into a project in GitLab. We make an effort to keep a healthy mirror synchronized with external repository at least every 30 minutes. And a user is also able to update more often by using a functionality that we call update now, which we also make an attempt to handle common failure scenarios gracefully. It's very useful for teams that have a canonical version of their code in an external code hosting service and want to have a secondary version hosted on either GitLab.com or their own instance. So, for example, users can have their code hosted on an external code hosting service and let's say that they want to leverage our CI service, they can set up a pull mirror that is captain sync and it runs all the pipelines that were configured for that project, which is a really common use case for this feature. So, let's quickly do a demo. So, right from the start we have a really basic project that I created on GitLab.com and I have a bare bones project that I created on my own instance and so we're just going to quickly copy the HTTPS URL from this project and go to settings repository and we're going to go to mirroring repositories and right from the start you start seeing a lot of options that are made available to you by using the mirror repository so you can just copy the URL and you can see that this URL can either be using HTTPS, SSH or Git all of them are supported. You can also choose the direction of the mirror so it's either a pull or a push. In this case we want to synchronize into my own local instance not push towards the external one so we're going to select a pull. The authentication method will not be necessary because we are dealing with public repositories people privately probably require an authentication and then we have a couple of other options mainly the mirror user which is going to be the user that will be the author of every event that is created due to the mirroring. There are also options such as overwrites, diverse branches this makes it so that when you're trying to mirror and both versions don't have this both versions have a diverse branch it will override that one instead of just failing and not mirroring at all. The trigger pipeline you already mentioned what it does and the only mirror protected branches so based on the specific set of rules that you put for protected branches you can only mirror those branches it's really useful for instance for something that we call the bi-directional mirroring that I won't get into on this call. So let's go ahead and create a mirror repository and as you can see created a mirror for us and if you refresh it needs to be updated straight away since this is a really small project there you go, it updated 13 seconds ago and if you go to the main page of the project you can see that we now have the contents of the external repository and another cool feature that we have is for instance if I go to the main repository and make a change into one file so if I go onto my local instance and I go to the mirroring settings again there's this button that we make available from the update file so we can click to basically update the mirror as soon as possible so let's click that now what does that do exactly just set the next execution time to now? Yeah exactly it sets the next execution so the next project I will get picked up by the scheduler I will get into into more detail further down the line but basically what it does it passes everyone in front of the queue so that it's the next project being picked up and we can go onto the dashboard and you quickly see that hey everyone the update, the project has been updated so yeah, on with the presentation so there are a couple of key factors that kind of make this future all work so these are the ones like the capacity, the transition of states the state management and determining when a mirror update will be attempted again and I want to share with as well like some key metrics to kind of give you a scale of how this operates on github.com so we have over 50,000 mirrors being updated within the last 30 minutes so it's a pretty large scale so let's first talk about one of the key factors what it is, it's basically a reddit set that contains the IDs of all the projects that are currently waiting to get updated or are already being updated the total capacity is a fixed number that can be configured by any github instance administrator and it is basically used as a way of limiting the size of the cycle queue so that you don't schedule a lot of projects at once it's mostly like a bottleneck that we need to do but if you want to have your mirrors updating as frequently as possible I suggest that you put the capacity a bit higher than the concurrency that Cyclic has configured that will ensure that Cyclic always has jobs to perform and it will in essence translate into more frequent updates if you want, on the other hand if you want to have less frequent updates and you don't want to have Cyclic performing jobs all the time you can lower the capacity and instead of having the bottleneck of the actual Cyclic concurrency being the bottleneck you will instead have the capacity as the bottleneck so the next one will be spent transition why would we pick one over the other? what do you mean? why would we have the capacity be the bottleneck instead of Cyclic? basically as I mentioned so we have a configured concurrency a yellow concurrency for Cyclic and basically there's this Cyclic tube that we have and it's basically managed by the capacity that we have if we make it lower and the concurrency is higher it means that it will pick up every job and we will run out of things to run in the capacity instead of always having project IDs ready to be scheduled at any moment's notice does that make sense? so ideally they're the same or similar to just keep Cyclic busy? ideally they will be a bit higher just to keep Cyclic busy if you make it way too high like if you make it like 5 million you won't make a difference because we don't have that many projects and it will only translate into a bigger Cyclic tube instead which is not what you want and usually we recommend to have the capacity set to about well in the case of kit.com to about twice the size of the actual concurrency Cyclic can take on at any time so there's always a full queue of new jobs waiting for Cyclic to pick up if you lower the capacity to the exact concurrency Cyclic and handle at a time and once a new spot opens up in Cyclic we first have to wait for the update or mirrors worker which you'll see in a bit to come alive again in fact there's another gap in capacity schedule another job and only then can it be picked up by Cyclic which means we lose precious seconds during which we could be updating another mirror so the capacity we use to always keep that queue filled up enough and there's always something new for Cyclic to do instead of having to wait for Cyclic to be open again before we schedule something but you'll see the code in a bit I know that Tiago is going to cover it Yeah, so now let's talk a bit about state transitions and by the way I didn't say this at the beginning but feel free to interrupt me at any moment if you have any questions I'm not currently checking the chat but we will have a state transition take a run so the state transitions so the way mirroring works is divided by five states we have the none, the scheduled, the started, the finished and the failed and each of them are responsible for performing a specific task for instance the scheduled state will be responsible for actually scheduling the worker that will perform the updates the started will mark in the database the point where it actually started performing the update job and the finished worker, the finished state will mark the time where the mirror successfully finished and sets the time when it will get updated again and the same for the failed but instead it will mark the time where the mirror finished unsuccessful this is also really useful information to use as we'll see later on to find mirrors that are inconsistent states with the overall system for example a mirror that is in the started state but that doesn't have the psychic job running attached to it so there's also the state management and basically the mirroring system has three focal points that tracks the progress of the new so it's the database, psychic and the red disk capacity states the database basically holds all the information such as status, the job IDs the timestamp set point at which the mirror started and finished psychic provides all the information about the mirroring queue and we are able to know based on that if the specific job has finished or not and with the red disk capacity set as I mentioned previously we are able to know if the project is we'll tell us if the project is waiting for it to be updated or if it's already been updated of course the objective is to make the system as self-healing as possible but there could be there are some scenarios where the system doesn't recover yet and we have to do it perform an operation by hand and we are planning on doing an emergency test about it soon so for example if a project in the database says that it has started we can check that the job ID is saved in the database and match it with the psychic mirroring queue to see if it's finished already so if you see that the job has finished already and the project is still in the started state it means that the psychic wasn't basically finished and wasn't able to communicate the database with the database in order to transition the state to either finished or failed and this is the case that we picked up and basically handled as we will see later on so another thing is to know when should the mirror get scheduled and for this we developed a formula that basically takes into account the time that it took to update the mirror the last time and the number of retries that were necessary to actually get a successful view out of it so we can move into more detail into the formula we currently have this back off period and the jitter because previously this is a bit of legacy that's currently in our code base but the objective was to spread the load a bit more but as you see further down the line you have the minimum delay and the maximum delay which kind of bound the time stamps a bit the main thing that I want to focus is the retry count which takes, it's a huge factor in when we are actually going to update the mirror next usually the mirrors finish the finished work in two seconds so it's really fast and in this case we are going to have a 30 minute delay which is the minimum delay that we can have and as I mentioned we want to penalize mirrors that fail often from running as frequently as the healthy mirrors so we always want to prioritize healthy mirrors if the mirror is for any reason which a maximum number of retries we are going to put it into a specific state of heart fail where it will no longer get updated and instead make a request to the user to actually solve the problem by hand so this is a basic run down of the workflow so the scheduler will wake up every minute it's a front job and you'll pick up all the mirrors that have the next execution due for an update so previously to the current time that we are on it will schedule all the mirrors until there is no more capacity available at the moment and after each mirroring starts it will basically fetch all the changes from the provided remote URL and update the respective branches and tags with the new information after updating the mirror we will always remove that project from the capacity list and set the next execution time but then we will have like two outcomes you can either finish or fail and if it finishes it clears the reply counter basically sets it to 0 again and if it fails it gets incremented and if it surpasses a certain limit as I said previously it will get a heart fail so this is the basic architecture of both the state transitions and the workers and how they communicate so you can see that we always start with the non-states but we transition into the schedule then it started and then it can either be a failed or finished mirror that will sort of go back into the scheduled state again and in the life cycle update all mirrors worker will get rescheduled every minute to wake up it only communicates with the project's database it requests the mirrors that are ready to be updated and basically schedules a project import schedule worker for every single mirror and then that schedule worker will as we will see in the code base it will schedule a repository update mirror it's the only purpose of that job and then the update mirror worker will actually perform the work of updating branches and the types and all that so far are there any questions I'm not being able to see the chat for some reason are there any questions so let's move on and do a quick code guide oh ok for some reason you can see the chat so let's do a quick code guide I have all the files that are kind of important for the whole new system to work in our web ID the first one that I want to touch on is the project import states so you can see the first thing that you can see is that the table name is actually called project mirror data and this is also a bit of legacy code that we have previously only worked for mirrors but we have since migrated imports and ports as well to this logic and every project will have a project import state so why we decided to separate the new import and ports logic into their own model was because the project's table was getting too big so reads and writes were getting a bit too expensive and we could make this a bit more lightweight as well the second thing the third thing that I want to show you in this specific file is this transition here so the transition from the finished or failed state to scheduled will actually perform this method called add import job that we can see here so add import job over here that you can see that if it's an import and the repository doesn't exist it will actually schedule an import or a port but otherwise it will be scheduled to a repository update which is the actual job that performs the update operation and we can also see the company part so there you go the super basically calls this portion here but it's either an import or a port another yeah so this is basically what it does it returns a job ID and we update the current project with that job ID that will basically keep track make sure that when the job finishes it gets cleared and otherwise if it's in a started state we are going to use this job ID to make sure that the job is still performing well so going on to the part of the project import state let me see so this portion as I mentioned you can see that we decrement the capacity when you transition into a failed or finished state and we increment the capacity when we go to the scheduled or started state and again from a started to a failed or finished state you only see two differences it's the increment we try count as opposed to the reset we try count when we finish and we will set the next execution timestamp and on the failed we only set the last update that when we actually finished we set the last successful update that as well to mark the successful so here I would also like to show you another method that sets the next execution timestamp it's also in the slides but you can see here we actually use this method called base delay here that has the formula that I showed you basically using the current timestamp minus the last update it started at the duration of the last year on the project model there is also which is really useful for debugging practices as well but it's also used for scheduling which is the mirrors that are ready to synchronize and we do this by having a preset which we easily provide the time that we are in now and we fetch the projects that have a timestamp that are previous that timestamp another module that is really important particularly in the core of the entire feature is the githlite mirror module it has the logic of scheduling front jobs managing the capacity checking the available capacity and so on so you can see here that we are configuring the front job of the scheduling worker called update all mirrors worker and it runs every minute we can also see that we have the logic of incrementing and decrementing the capacity that checks the available capacity right now on the ready set so now we are going to move into more the workers and how things get scheduled and how we have a better notion of how the transitions of states work so in update all mirrors worker the only task is to grab projects that are ready for an update and schedule them and fill the capacity with them so you can see here that we save the available capacity into a variable we fetch the projects that are ready to be synchronized here and we do it in batches and we take as many projects as we can as the capacity lessens state and in the end we just perform schedule on import schedule worker to build projects so we just did that after that we will transition into the project import schedule worker which it's only like it's a really simple worker it's only task is to actually transition the project in scheduled state which will be responsible for scheduling the update mirror worker which is actually the one that performs the work which we are going to see right now so the repository update mirror worker basically transitions the project from the schedule from the started state and then calls the update mirror service which is the one that will fetch the remote and update branches and tags from the external repository at the end it will finish or file and we handle those in these methods so we either finish transition to the finished state or mark as file this method actually isn't possible for filing the mirror it sets like the error message and so on so the update mirror service which is the one that is called by the repository update it's a really basic service it basically just fetches the mirrors and then updates the tags with them and updates the branches there's a bit more logic because of the as we mentioned that the protected branches only mirror protected branches or override divert branches so you can see the logic mirror so only mirror protected branches you should skip the branch or not and the way we handle the divert branches because of that feature as well so the other really important worker that we have is it's also a cron job that is called stock importance worker it runs every 15 minutes don't get misled by these 15 hours this is just to as a mark for all the workers that are dependent of this stock importance worker so you will be sure that every mirror job will time out past 15 hours actually attempting to update we are planning on decreasing this number by the way but right now is what we have yeah one thing one thing that you would want to know is the project with job ID we are going to mark them as this portion here without job ID is no longer possible and I am doing an emergency task to remove this portion of the third base so this method is no longer used but this portion here is actually really important what we do is we grab all the projects that are in the scheduled or started state and check for their job IDs and when you get the job IDs we actually use this method called completed job IDs that are like a psychic base we have internal tooling that keeps track of when a job has finished and we can use that to check if the jobs for specific mirrors have already entered and if they ended and we are in a scheduled or started state we should mark them as prior here and it's basically that so yeah back to the slide presentation so are there any questions? I know this is a lot of information to take in so please bear with me and let's start diving a bit more into the troubleshooting so here is an example of an unhealthy mirroring system you can see by looking at for instance like the project mirror updates over here and the mirror updates at least 10 seconds over here you can see this huge mountain of projects like 15,000 projects that are not getting updated at the right time coincidentally the project important schedule worker and the repository update mirror worker are not performing any work during those times or at least very minimum work so in this specific case what happened was that this not important job worker wasn't timing out and that's why it was timing out so we weren't able to actually mark projects as file and that's why after the situation gets solved which is why you see this huge guy in the project mirror updates over here and the schedule start capacity stays the same is because we are not being able to mark those failed projects as failed because it's talking about job worker once again so there's a link as well if you want to check in a bit more detail these were paragraphs they are really interesting and yeah so for troubleshooting what I usually do is I always refer back to the project important states model or the project mirror table and check the last error message, the root write down the job ID, the last update, time stamps and yeah work and it is really useful to kind of graph that information into an excel sheet or something and kind of find pattern inputs it was really useful when I was developing this before so you can also check the available capacity by using the the GitLab mirror method it has a lot of functionality that is useful and you can take a lot of inspiration from that module for debugging purposes and you can combine this method with the project mirror to same time dot now to actually see if there's actually projects to fill the capacity and for some reason we are not filling that capacity or something like that that might indicate that the schedule is not working correctly which is something that happens in the beginning of this feature so you can also check the status of the workers for each mirror in the schedule for a started state as I mentioned the whole the job ID in the database for each project for a started state and those job IDs you can actually use this command called gitlabsitics that is job status job ID and pass it to job IDs and it will basically tell you if they are completed or not you can also retrieve the project ID that is currently in the red set by using the sMembers method it's useful to do projects that could be stopped or really for instance a finished or a failed project should never be in the capacity they should always be removed so if for some reason a project transition into a failed or finished state it might indicate that the service is not communicating well with the Redis without Redis so for clearing data inconsistencies so when the database is inconsistent the whole point of us having this distributed model of knowledge let's say is because we want to keep it we want to make it self-evident and we want to keep this data as synchronized as possible so for instance when the database is inconsistent in site like a project can be in the database starting in the database but site already finished the job we can use stock import jobs or we can schedule stock import jobs which will look at the job IDs maintained by this github psychic status and usually this happens when the actual update worker times out for any reason you can also have the capacity set being inconsistent with the database and site, for instance if the project is finished but the project ID is still present in the capacity set which was the scenario that I mentioned previously right now there's no real self-healing way to solve this, this has to be removed by hand by removing the project ID from the capacity set you can do it by either running the decrement capacity and feeding it the project ID if it's something that you're seeing that it's occupying like the entire capacity it might be worth just deleting the whole thing and starting from scratch this though should be taken with its precautions because what this basically will do, if you remove all the project ID from the capacity set you will basically say to the psychic hey you can schedule more jobs the psychic you unnecessarily so now let me show you what a healthy new system looks like and you can see that it's a huge difference so you actually see that a project import schedule worker and depository update worker are consistently performing jobs you see that the scheduled or sorry capacity doesn't it's not sale so it's not stuck with projects and the two most important graphs that we have which are the project new updates overview and the new updates at least 10 seconds are really healthy, you can see that the amount of projects that get 10 seconds overview of an update from the actual scheduled state from the actual schedule times it's minimal, it's really minimal so yeah, these are also some really useful things that I use all the time so the documentation for the full new system, the state machine actually corrects the documentation it's also really useful because the whole system relies on the state machine working properly there's also some guidelines to troubleshoot full new made by the infrastructure team and I'm planning on doing a couple updates on those guidelines and there's also an issue of dynamically determining a new update interval based on the total number of new so we want to get rid of the lower bound and the upper bound scheduling times and instead we want to make it dynamically based on the information that the system has there's again the project new graphs that has a lot of information and please if you have any questions about new systems, feel free to ask in the VAT channel at any time are there any questions? I think you said that the mirroring by default runs every 30 minutes is that a value that can be adjusted at all or is that set at 30? I don't think so, the minimum delay right now is set to 30 minutes so that's actually a really good question I don't know if you probably should change it because as I mentioned we are trying to make it dynamically so getting rid of the minimum delay and the maximum delay entirely I'm definitely going to discuss this yeah right now the minimum delay is not configurable but like Diego said the intention is and then pretty much always has been to have that kind of automatically adjust itself to the actual psychic capacity and the actual time so that it never tries to update more than it can actually handle but also that means as the amount of if the amount of mirrors grows but you never actually adjust your psychic capacity automatically the time in between two updates per mirror would go down to not overload the system too much but we didn't have time to implement that at some point when we did need to fix the dire situation at the time because we had far more mirrors than we were expecting and we were not hitting any kind of reasonable update frequency on those so we kind of implemented that hard-coded 30-minute limit and we still haven't kind of made or found the time to switch to the dynamic approach we had originally wanted to deal so right now it's hard-coded to 30 minutes but the idea is that on github.com 30 minutes is pretty much the ideal value so even if we installed the dynamic stuff it would land on that 30-minute frequency but on like a smaller instance the update rate could be much higher you would still want some kind of lower bound there so that you don't try to update the repo every five seconds because it happens to be the only mirror in the system so psychic can handle it of course that's not somewhere you want to go either but the idea is not for this to ever be configurable as in just pick a number and github will try but more for it to be github will do its best within the resources you've given it because ultimately github knows and has the information available to it to figure out what number it needs to be updating it so that it doesn't overload but still uses the capacity and the resources to the full to the full possible extent and you also make available the update now functionality especially because it is like you knew that there was going to be a delay and if for any reason a customer actually wants to update a bit more we can also use that update now but right now I think Tiago mentioned earlier that we have the ability of someone can choose to host their project somewhere else but then use kitlab just for CICD and we have an API endpoint that will effectively do the same with the help button does so you can call an API endpoint that will immediately trigger an update which like Tiago mentioned earlier effectively is set in the next execution time which means that the next pick of the update all mirrors work will pick it up to be updated immediately and when you set your kitlab project up to be a CICD project for a project on github for example we use the github webhook which will be called every time someone pushes to github to automatically trigger the update now endpoint on the githlab side of things so that the mirror will sync immediately and the pipeline will run immediately when someone pushes to github and we are actually smart enough to also report that pipeline status back to github so that's really the way of using github CI just like you would any other standard on CI solution like you know circular travis but we've built it on top of this mirror ring functionality and that update now feature hooked up to that github webhook that was very helpful thank you and can you review again how you determine when a mirror is hung can you just go through that again sure so there are the most common case when the mirror is hung it's for some reason the update worker so let me go back to the architecture it might help you so this graph that I showed this diagram that I showed earlier you can see that the repository update is the actual job that is responsible for updating the information on what we pulled from the external repository and this can actually if it can happen that there is a timeout or for some reason the job crashes and we don't actually get to transition the project onto a finished or failed state so we will stay in a started state but it will still hold that job ID but the project already the job already finished and what we do to kind of catch these scenarios here is we have the stock import jobs worker which I showed earlier here so we have this jobs worker that basically catches all the projects that are in the started or scheduled state and grabs their job IDs so let me try to show you again so we grab the job IDs for those projects and we call okay, it's here so we find the job IDs that are already completed and if it's completed that means that we have a project that should be marked as failed because it shouldn't be started anymore like the job finished so at some point there wasn't the communication to the database to transition the project into the next state that makes sense okay, that helped thank you again, if there are like if you have any other questions you will be able to drop by the peer channel anytime, I can help whenever necessary okay, I appreciate that, I may have some clients that will have questions like that so I appreciate that, thank you so, are there any other questions? okay, so if there are no other questions, I want to thank you for listening to this talk and yeah, I wish you all a great day bye thank you