 Cool. Thanks, Ray. So welcome to the May 13th edition of our GitLab runner open office hours. I mean, intent by this call is really to talk about anything to do with the runner that's kind of on anyone's mind. If this is obviously perfectly timed and completely intentionally with the current hackathon that's going on for GitLab. And so if anyone's got a MR they've started or thinking about making a contribution to the runner, we'd be more than happy to talk about it, give some direction, give some feedback, we can do that live. And barring that, what we'll do is kind of just talk about some things that are probably helpful to anyone going through that process and going through that or considering doing that. The call is going to be recorded and then we'll share it to YouTube once the video processing can go through, not actually being live streamed. And so if any of this is helpful, someone can come back and look at it again later. Anyone on the call, please feel free to jump in and unmute at any time and just ask a question. We'll probably pause and just check in on that if anyone, to try and pull some questions out if anyone has any as we go through. Yeah, and I think today the plan to start off is actually to get Tomas, who is one of our senior engineers on the team to give a bit of, not actually a review, but actually just kind of talk us through a bit of a walkthrough of like the life cycle from how the runner sees a CI job. Any questions or anything else we missed going over? I don't actually have a script in me because I'd probably do this a little bit differently every time. Nothing? OK. Tomas, you want to get started? Sure. OK, so let me show my screen. And we will do a little walkthrough through the sources of GitLab and see how the CI job is being handled by the runner, because after all, this is what is the main purpose of the runner, to execute all of our jobs. So the runner, as a process when it started, we know that it may have different executors that execute the job in different ways using different technologies, but there is a common part for any kind of runner deployment that we will have. This is the main process that goes through all runner's sections from the config.tom file and asks for jobs for each configured GitLab connection. And then if a job is resaved, we start processing it. So our main endpoint is the function that we see here. It's the method that is started when you execute GitLab-run command or when you start the GitLab-runner process through a system manager, because system integration that we have under the hood also uses the run command. So after doing some initialization, like starting a metric server, if it's configured, start it up with session server and a few other things, we, at this line, start a GoRoutine. Let's go there. This is the main GoRoutine that triggers job requests. This is where the check interval from config.tom file is being used and being interpreted. So what's most important here is that we check how many runner's sections, I mean these ones from the config.tom file, we have defined. I think each every one-minute runner checks if the config file was changed, if it was changed, then we reloaded. And then we start tracking also the new entries or we stop tracking the old that were removed. Anyway, at this moment, we know that we have several runners that we want to proceed with. Let's say that in our example case, we have one runner process that was registered three times each time in a separate project on our GitLab installation. So we have three project-level specific runners. With this function, we will trigger the request for each of these registered runners, each check interval seconds. The interesting part that may be surprising for some people is that the interval between handling, between starting a request for a specific runner's entry is equal to check interval. However, if we have more than one runner's entry is defined, if we have more than one runner's registered for one config.tom file, then the overall number of requests will be bigger than one could think about. Like if we have the three example runners that I said, if we define check interval as nine seconds, then between two subsequent requests for project A, the runner that is registered for project A, we will get this nine seconds waiting time. However, we will generate three different requests in this nine seconds, each one after three seconds, because this is what's taken here. And then knowing what is the sleeping time, what is the pause between generating requests, we use the method name fitrunner to go forward. So let's go and quickly see. There is nothing special happening here except that we check if the runner is healthy. By runner is healthy, we mean that if we were residing errors on network communication for a few subsequent requests, then we mark a runner as unhealthy. And then we stop asking such github instance for new jobs for some time. I don't remember now how long time it is. We would need to dig in one of the files. However, this is the place where we check this healthy status of a specific runner. And we go forward only if the runner is marked as healthy. And this is a little go magic about how you can pass data between different asynchronous go routines. However, what we are interested in, how this ends, is the method named process runners. This is a go routine. We can see that it started a little above. This is a go routine that waits for a specific runner entry to be triggered to go forward. And not focusing much on what's happening here at this moment. Let's go to the process runner method. And here the magic begins. First, we check on the executor provider if we even are able to do anything. We will not focus on this right now. Maybe at the end there will be a little time to say what is the difference of acquire and release between let's say docker executor and shell executor. However, this is a quite important step. This is something that powers, for example, the GitLab.com share to others that are using the docker machine executor. If we get from the executor provider positive response that we have, that we have some capacity to execute the jobs. We then call another acquire method, which is named acquire build, which checks us against the configured limit. So we have the general level concurrency setting, which defines the maximum number of concurrent jobs that will be handled by the runner, no matter from which runner century it is coming on. But then for each of the entries, we can define a specific limit. And this is the place where this is checked. So we check if the specific runner entry that we now try to handle is allowed to execute any new job. Let's say that we can go ahead. We are still in limit. Create session. This is a helper method that we used to start the bug session server if we have this configured. This is not in the scope of what we will be talking about today. And here it is, request the job. So let's dive into it. And this method is the one that finally triggers the HTTP request to GitLab API to get the job. So we already passed several checks, and we still don't even know if there is a job waiting for us. First, we are checking yet again if another limit, we have the request concurrency, I think. Concurrency setting on the runner section, which is checked at this place. If, again, we are still in the limit, if we are allowed to generate another request to the GitLab API, we call the network request job method. And this is finally the moment when we talk with GitLab. And then the magic happens on the GitLab site. GitLab checks what kind of runner is asking for a job, if the runner is even outraced for this, if the token is matching. GitLab makes the tags matching at this moment, so it sees what runner asks for a job. It has a list of jobs that could be available for such runner. And then that's such magic, like checking the tags, checking the CI minutes, for example, if this is the shared runner on gitlab.com. And hopefully, at this moment, we get the job payload with all of the information about the job that we are interested in. And saying about this, we can go to the job response definition. If someone is interested how the job requesting API payload is defined, then the common slash network.go file, and this structure is the starting point. This structure and all of the structures that it inherits and composes defines us the full information about the job that we get. Runner is not aware of anything outside of what it gets here. Runner doesn't work in a pipeline context. Runner doesn't work in a project context. It doesn't know anything about the specific pipeline settings that you may be set in the GitLab CIM file. Runner calls only about the specific payload of a specific job that was received. So if something is here, we can handle it in any way. If something is not here, then we will not know such things. Let's get back to our request job. So at this moment, we have the full information about the job. The next important thing that happens here is the network process job call. This is something that finally starts the trace handling. So this is, again, another girl routine that asynchronously works and handles updates of the job status and the job trace. And we will get back to this place in a few moments. So we requested the job. We requested the job. And here in this defer function, we can see the initial, the final thing that will happen with the job. So checking if we even have any errors that happened. But this will be the final step of our job lifetime. So let's keep it for a moment. And here is some things that we do at the background. So having the job data payload, we create a common build object. We assign a few things to it. This is a place where we update some Prometheus metrics that you can export from the runner. And finally, we can call build run, which starts the job processing. In build run, the most important call is, in fact, here. Because all of this is still a preparation, setting some contacts, defining the build logger that is a way how we, let's say, multiplex the log messages so that they can be saved in a runner process log but also sent to the job trace. But at this moment, we have the executor that we will use to handle the job. We should have it ready. And we can call the run method to proceed the execution. And this place here is where the execution starts. So after all of the preparation in the previous methods and here, we start the job execution in a separate routine. And then this is a place where we wait for the job to be finished. And as you can see here, there are three possible ways how the job execution can be interrupted. First, maybe let's go from the reverse order. The last one, but the one that is most important for all of us is that the job was finished. It could be finished with a failure. It could be finished with a positive result. But at any case, here, we will get either an error or a new representing that the job finished successfully. And this can be immediately processed. And we can start getting back to finalize the job processing. The second way is the signal received on the system interrupt structure field. This is something that is propagated from the multi.go. For example, when you will start the runner in the foreground and you will hit Control C or when you will send the kill signal or the, I'm sorry, not the kill signal, the sector signal or the circuit signal several times. At some moment, the runner decides that the interrupt signal was sent and it starts propagating the signal down to the Goverting Stack. And this is one of these places. So when we receive this interrupt signal from the user, we just stop the job. And then there is another place where we send the information to the job script execution itself to be interrupted and to be stopped in whatever context it is executed. And the last thing is the context, the context that we pass here to the run method. At the moment when the context is finished, we also finish job processing and then we try to interrupt the job. If we will exit from this select, then we send the cancel to the job in case if this was one of these two situations. And we wait for the job to be finished to just finally handle all of the left steps. However, we said that we start the job, but we didn't see how it is started. So let's go here. And this method describes in what steps, how we name it, the job is executed. And what we can see here. We can see that we have some prepare, get sources. This is the place where we either call git clone or git fetch, all of the things that are happening around. This is the place where the Git LFS commands are executed. This is the place where the sub modules are handled. And since, I don't remember if 12, 7, 12, 8, when we introduced the messaging in the job trace, each of these steps is described in the job block. That would be 12, 9, 19. 12, 9. Or 12, 8. I think it was 12, 12, 5. Anyway, a little above you can see how each of these constants is mapped to a text that you can see in the job output right now. So everything that happens after getting sources from Git repository and before the next line that will represent a step is happening through this. Restore cache is a place where we try to restore the cache, either the local one or download it from the remote cache like GCS or S3. Download artifacts is the moment where we use the job payload to download all artifacts that were defined for this job to be downloaded. And as you can see here, the construction of the error checking, each failure of each of these steps, except of one that I will point in the moment, is something that stops the job processing later. So if we will have a near and on the prepare step, then all of this will never happen. And we finally will go back with the error taken from the prepare step. This is the place that the users are mostly interested in because this is where the before script and script are executed. The important thing to know is that before script and script while there are two separated entries in the GitLab CIMU definition, and the before script can be also set on the general level. So we could say that there are three different places in the GitLab CIM where these scripts are defined. In the job execution, there are just concatenated and executed together. So the before script and the script share the same shell execution context. So anything that was prepared, exported in the before script, anything that relies on the shell context will be also available in the script. And here we can see the after script execution. And this is one of the things, I'm sorry, here we can see the after script execution. This is one of the things that we know is little confusing with users like why if I prepare a sausage agent in my before script, why I can access it in the after script. This is because after script is executed in a totally separate context. Why? There are two reasons. First, after script was introduced as a way to do some cleanup no matter if the main build script failed or not. And we can't put them in one shell execution script because we failed the script immediately on a detected command fail. So if something would fail in the before script or in the things defined in the script, we would never reach the parts defined by the after script. The second thing is that if after script fails, we don't want it to affect the final job result. You can see that this is the only be execute stage call when we don't care about the error. In fact, to make it explicit, we should do it like that. We give this after script place as a moment that will be executed off always, no matter if the script or the before script failed or not. And we also don't care about the result of this. So you can use it. You can do any cleanup if you want. But it will not affect your job if it will fail. After finishing the script execution, we get back to few predefined steps. The one is cache archiving. So again, saving it to a local archive and additionally sending this local archive to a remote cache server is such as configured. And then we have uploading the artifacts and here we can see that it's either uploaded on success or uploaded on failure. This is where we respect the when setting from the artifact session. And something others quite recently upload referees. Referees is a nice feature of the runner that we started experimenting with. This is something that, for example, allows us to request some Prometheus metrics pre-configured in the configuration file and upload them as another type of an artifact. It's not popular yet. We are still experimenting with this. But there is a lot of power that we see in this small call. And then if we went through all of these executions, we have three possibilities. First one is that we had an error before calling the upload artifacts. In that case, this is the error that we are mostly interested in. If the job failed, we tried to upload the artifacts on failure and the artifacts upload also failed. Then the job failure for us is more important. So we check it and we send it up to the call stack at first. If we had not seen an error before calling the artifacts upload, then we return anything that the artifacts upload ended with, which may be an error or maybe a success. So this is how the steps are defined. And let us get a quick look on the execute stage, what it does. So skipping all of the logging, we check what shell is defined for the executor that we use. We generate the shell script. And this shell script contains things like setting up the variable exports for the variables that were defined. This defines the configuration that fails the script execution on the first command fail, which is handled differently for bash, which is handled differently for PowerShell, for example. This is where we set the configuration that enables the back trace output. If the CI, the back trace feature is used. And this is of course where at the end all of the script lines that the user defined are added. What is important to know is that from every each line defined in the before script, script, and after script, we in fact generate two lines in the script. First one is an echo that prints the command. Second one is the command execution itself. So after calling this, we have the script. We prepare executor command structure that will be next used in a proper way by different executors. We define if this is a predefined or not command. So the user script and the after script steps are not the predefined ones. This is what the user have control over on. Everything else is predefined. This is used, for example, in the Docker executor where we distinguish if the script should be executed in the image defined by the user or in the helper image that we provide. And we build and execute the build section. And this method here executor run is what passes the prepared and ready command to the executor to be finally executed in the final environment. And what happens there is another magic probably for another call. From this place, what I would like to show is the executor and executor provider interface. Because we have five, six, seven. Don't remember the number now. We have several executors that are in our call base. Since version 12.2 or 12.1, we have the custom executor which allows the user to integrate his own execution methods with the runner workflow. And these two interfaces show us what is the general flow of using the preparing and using the executor. So here we can see, for example, the acquire that we started the story with. So this tells the runner if the executor at this moment is able to execute a job or not. And here we can see, for example, the run method which gets the prepared executor command and then does the execution that we really care about. So at this moment, in this place, we have the job that now is running. So if you're using a Kubernetes executor, this is the moment where the pods are being created and the job is being handled in the pods. If you're using the virtual box, this is the moment when the runner will start connecting with virtual box, creating the virtual machine, and then trying to connect with it to execute the script in the virtual machine. Whatever this execution will return, it will be then handled as a job result. So if we will get an error here, then in a moment I will show at which place, we will mark the job as failed, either as failed because of the script failure or because of the job timeout or failed because of something that was wrong on the runner level. If we don't get an error here, then this is where everyone is happy because we marked the job as succeeded and we can go further with the pipeline. But before we will go back to the place where we set the final state of the job, let's switch for a moment to the trace.go file because we've been here for a moment and I said that we will get back to this. As I told a few minutes ago, the trace handling started almost immediately after we get the job response from Gitel. And what's happening inside of this part of the code is the watch loop. Watch loop that is executed constantly in some intervals. I will say in a moment how this interval is defined until the job is finished. And what is happening in this loop is the incremental update. So as the incremental update, we first send the patch trace request. This is something that sends only the new part of the job output that was received from the job script since the last patch request. And if the patch request succeeded, we send the touch job request to say to Gitel that the job is still working. Let's start with the touch because it's a little shorter. So what's most important here is this part. We basically sent information that job with ID that we hold is in state running. We send it to GitLab. What GitLab does at this moment is updating the updated app field of the job object. This prevents the mechanisms in GitLab from considering such a job as a stable one that should be cleaned, should be failed because something happened and the job is no more being updated nor it was finished in the proper way. What's more interesting is the send patch method where we of course send the patch trace itself. So again, another API call that just pushes a part of the output and ensures that the ordering is proper and that we don't mess the output improperly. But here is something that we implemented in GitLab I think in 12.7 and then started supporting this in GitLab Prana in 12.8. The update interval defined from GitLab. When we send, when we start the job and we open the job page in GitLab UI, then GitLab detects that the job is being watched. And after each patch trace request it sends us back how long should be the interval before that we wait before sending another request. And if we have the job page opened it will be currently each three seconds. If we close the page or if we even didn't open it because the job was started in the background by let's say a Git push and we never opened the job in the UI then GitLab will instruct runner that this request should be sent each 30 seconds. This was a huge improvement that we've made these three releases before to just show the scale on gitlab.com after we released this change. We updated our fleet of 10 runners and some of the users updated their runners. We had something about 15% of available runners updated and this ended in reducing the number of patch trace requests by half from 40 to 20 million requests per day. So this is something that we were very happy about adding these few releases ago. And this is what happens constantly. We watch the script execution in the executor watch the script execution in the proper way for each of the executors and through the built logger that I was showing properly in one of the places it connects to the trace object and pushes these updates constantly to GitLab so you can then see the job output more or less in a lifetime update. Okay, so let's go back to the execution and start finishing this a little. So the execution happened through the executor. We've got some result. So going back from all of the calls let's say that we didn't have any context finished case nor we didn't have the interrupt. We finished the job properly. We waited for the job to be finally marked as finished and then we get back with the error value here. With the error value here we have it used first in this place which is the set trace status and here you can see that if we didn't have an error we leave the job succeeded line at the very end of the job output and we mark the trace as success. If we had an error and it was a built error this is used in only few places then we mark the job as job failed and we use trace fail to propagate this to GitLab properly. If we had an error but this was not a built error but something else this is what we consider to be something either run or internal failure or a system internal failure. For example using the shell executor someone deleted bash from the machine where the runner is running. We tried to execute the script. We couldn't because there is no shell it failed so it would be most probably marked as this job failed by a system fail. So going back for a moment to the job trace file success is internally calling fail with mallfailer fail what does it sets the failure reason and we have on the runner side only three of them so the script failure something that is out of our scope because it happened in the script provided by the user. Job execution timeout when the job script was being executed for too long time and the runner system failure that I was mentioning a moment ago and with the finish call runner tries to do two things. First send the final trace patch requests until we have something to send if we are still resiving something from the job output then we will try to send it unless until we will get any of the error response from GitLab or until it will be finished and then final status update where we try to send the final update. Here we can see that this is the place where we set the state of the job where we pass the failure reason that was detected on runner side and we look through this until until we get one of these states if there is some internal failure then we will try to repeat. So back back back back back and we are in the in the run so we are exiting from here we got back to the therefore defined in the runner command process runner which again tries to fail the job with the trace in case if it was some strange some strange case that it was not handled before and this is done after calling this defer function the job is the job is being released by the runner we call with another defer release the build which for example updates the Prometheus metrics we finally call the provider release which in case of some provider some executors restore some of the capacity and at this moment the job is no longer existing in the GitLab runner what happens next is happening in GitLab Thanks for sharing that Kamasa I there's actually a YouTube video I'll link to I'll put it in the chat right now but I'll also put it in the description of this once we upload it that kind of shows how this all looks at a bit higher level from the GitLab side like GitLab and then the GitLab Rails application so kind of where Kamasa's walkthrough starts and ends that conversation kind of goes over so it's pretty cool pretty interesting some good diagrams it's I think it's more looking at diagrams kind of at that level than say looking at the code walkthrough like a lower level like what we just went through Cool Thanks so much yeah I also pasted a MR from Sasha who's online and I think Steve you've already like started looking at that so in the next like 5-10 minutes or so I was wondering if it's worth just quickly going through it live while we have both of you folks on the phone or folks on Zoom yeah it's fairly small let me show my screen the goal behind this merge request if we look at the issue is the runner, the GitLab itself started supporting more different types of artifacts right before it used just to support zip files that will zip them up and upload them to GitLab but then it started supporting reports which is like JUnit reports, test reports, like licensing reports and things like that and those are like those are handled differently for the runner and for GitLab as well so like this was a suggestion from Thomas actually and one of the reviews we were having where we need to be more verbose and explicit what kind of artifact we're uploading if we're uploading a reports JUnit for example for a normal artifact and then it's Sashi, right? that's how you pronounce your name yep, it's Sashi, yep perfect, thank you and then Sashi contacted me like can we achieve this in the current code base? I took a look at it with what we wanted initially from the initial discussion we wanted to just say uploading artifacts for reporting it, right? but the runner does not have the information of the reports JUnit and this goes very well with what Thomas showed earlier, right? the job response so the communication between GitLab runner and GitLab is all through JSON but GitLab responds with the type and the type is basically if it's a JUnit if it's a license management and things like that but it does not specify the report so like I said, okay so the least changes we can make to make it more clear to the user and send more data to the GitLab runner is just print the type so for example here there's just a quick example uploading artifact as code quality uploading artifact as license management for example or uploading artifact as artifact so it's a fairly simple change I already looked at it and I was going to submit to more comments but the first comment is I see that we did the quotes here now I imagined us is because of the example I gave here but Go actually provides this automatically using the percentage queue and it will quote it automatically it's like percentage as but with quotes so that's one comment and the other comment first of all, I really like that you move all this into a single variable that was one thing I was going to suggest if you weren't going to make it so I really appreciate that now the next thing is the if else condition me personally, I'm not a big fan of if else's they're just like two branches that you have to take care of so one way I was going to suggest to do this is having a default value so let me open up the code and I like to give examples like what so if we go to let me open up the editor so let me check out the branch first branch now and here we can see I see this so to get rid of this else condition to make it a bit more clear I would suggest doing something like this so for example message prefix just do this but the default value is that we're going to upload an artifact right but if the type is not empty type is not an empty stream we will just do it overwrite and this like few lines shorter a few a bit more clear and then let me just comment I was already writing this comment perfect so edit to the view and then when I like suggest all the comments I'm pretty sure like we do have tests and we don't have tasks to actually check each string that doesn't make sense because that will end up being a pretty test but I always enjoy running through manual QA like let me run this manually as a user to see it as a user and like sometimes it really helps because sometimes not in this case but sometimes for example I wish there was a log to specify that we created container X or that we create polymax so I always enjoy running things manually so I'm just going to run the runner pointing to a local JTK instance if I go to my JTK instance I have this project and I should think I should have an example of CI exemplified with all the reports that we well some of the reports that we sent were that's not number two so that's that so this is the POTS right and now if we open now I don't remember if you see there's my heart actually I'll do this on my own time I don't want you to see me struggle writing the CI exemplified let's take a look at the codes to see if we can spot so we fix this we fix the if condition the message prefixes all down here we can see that the pipeline is false so we didn't mess up any tests so I think the only thing that is left really is doing the manual QA so I'll submit this review for you to take on and I'll do the manual QA fly so because we're a bit close on time do you have any other questions like do you think things were more simpler if there's anything else you can tell me I'm just getting started with the runner code base so it's so I'm new to going as well so yeah that's that's perfectly fine like it's really nice that he started asking him the issues of just open a government merge request because that helps us guide you in the right direction and like makes things a lot quicker for you and for us as well Thomas do you have any objections to this change or perfect so yeah as soon as you address this we can merge it all even assign a milestone before I forgot which I do way too often and they always corrected for me so yes thanks a lot for this confusion yeah thanks actually for you thanks actually for your contribution it's nice to do synchronous review anything else like guess we're sort of running up against no we're pretty much yeah at time no I don't think if I have anything else thanks for the kind of unusual one this month to moss what I love about doing this is that we create this artifact and we have this everyone who ever needs to look at it can go back and watch this video later so I think that will be really helpful but yeah good luck with the hack upon thanks cool right now get the recording posted after the call as well so thanks for having a good day and look at yep right thanks everyone