 Thanks. My name is David Nikolayev. I work on the Department of Energy's Exascale Computing project, and I'm going to talk to you today about Federation and Zero Trust continuous integration in GitLab. They asked me to give a little spiel on who I am, so I don't have any social media of any kind, and I figured I'd just do something random from my recent trip to Germany. This is a panda. I like to fancy myself a panda, rolling from project to project, so I somehow found myself working on Federation in GitLab. My background's in quantum foundations and mathematics, so I'm very interested about measurement theory and sort of the context in which you run physics experiments, and I find Federation very fascinating because of that, because I feel like it's setting up the measurement apparatus to study programs that could build themselves. But if you're interested in knowing my background more, we can talk after the talk. But if you'd like to follow up after this, I'm on GitLab, GitLab.com, and then we have a big issue here where we're trying to merge all these Federation enhancements into GitLab, so that's issue 33665. I'll present it right at the end so you guys can check that out later. So very quickly, I'm just going to run through what high-performance computing is, what the Exascale Computing project is, how we're trying to use GitLab CI in the ECP, talk high-level what the enhancements are, and then I'm going to do a live demonstration. So fingers crossed, everything flows as expected. So very quickly, kind of by the numbers, the Exascale Computing project is this big research project that was signed into Executive Order by Obama, and it's about $2 billion over seven years, includes all the national laboratories, staff from all across the country, and there's three technical focus areas. Today we're going to focus on the hardware and integration side of that. So kind of visually to give you a sense of the complexity of the project. So when you run CI for something like a web application, oftentimes you can write very small unit tests and have them run in parallel by themselves. With computational physics, you have this challenge that you have multiple timescales, multiple size scales, the coupling between something like a fluid dynamics flow or a radiation transport. It becomes a very complex and it's just not sufficient to run simple unit tests. You have to couple everything and you have to write very large integration tests. So to kind of achieve the level of testing we want for something as intricate as compressible flow, we need to actually test our codes in the environments they're going to run. So that involves not just running on the big supercomputers that will run the simulations but also the diverse ecosystem of software that exists there. So different compiler versions, different libraries, tools, all of these need to be exactly what the real simulation is going to use. So keeping that in mind now we'll look at the challenges. So really when you think about CI you want to have pure automation. No people involved whatsoever but there's challenges to that. Namely CI jobs run arbitrary code, right? You don't know who checked in the code oftentimes. You're pulling in code from a third party and you want to build in real time. So in typical multi-cloud environments, right, we have Docker containers that can isolate your environment but unfortunately those don't get bare metal performance and even still most Docker containers are just run as root. So in the HPC world, right, we need this bare metal performance. I mentioned that we want to run on heterogeneous hardware architectures. We have supercomputers that use Power 9 systems, that use ARM systems, x86 and then we have this rich software ecosystem where modules are built at very specific versions. The combinatorial nightmare of getting them all to work together. You really have to test exactly with how the big simulation is going to run. So trying to enclave it away to a container doesn't really work. On top of that, to tap into the HPC resources, we use batch schedulers and that in itself introduces some new complexity because of the time limits. You have to have certain money allocated to run for that amount of time. Then of course there's security and really that to me is kind of the fascinating part about Federation is it solves a lot of the security challenges. So kind of simplify it down. You cannot have jobs run by user A interact whatsoever with jobs run by user B. So you have this idea of an enclave. So how do we ensure that that happens? Well, we need to have really fine grained user access control and so I'm going to talk about that in the Federation enhancements. And ultimately what this leads to is this idea of zero trust. It's not enough to just be on the system, get through the firewall. Every single action you take needs to fundamentally be authorized. And so we're going to show how our Federation enhancements are working towards achieving that. I'm not going to talk about why, of course, we need automation. Hopefully that's self-evident. This high level is sort of the model that we're working with here. Projects host their code in GitLab or their own self-hosted GitLab. That's then mirrored to our central GitLab in the cloud at Oste. And then those jobs are picked up by federated runners at the different facilities argon and Oakridge. And you can see here, I'm going to talk a fair amount about the identity providers. This is the key source here. And I'll get into it in the next few slides. But we're really tapping into the omniott functionality of GitLab to trigger the jobs and know who's running them. So I'll get right into the technical enhancements we need to make. And it can be kind of summarized by analyzing what the challenges are. So really, the three key challenges were, how do we refine the user access control? Normally, when a CI job is run, it's run as root. And we can't have that at the national laboratories. So how do we refine that control? How do we directly leverage the HBC resources? I won't talk about that. That gets into some of the set UID and batch runners that we've been working on. And then what I like to call the three-body administration problem. We have server admins, runner admins, and, of course, admins at the identity providers. Thankfully, we all communicate, but there's still a challenge there of, well, federation. So the key insight to how we moved to enable this workflow is to realize we needed to make omniott the first-class citizen in GitLab. So omniott is this middle layer in Ruby that standardizes multi-provider authentication. And in particular, there is a schema for the auth hash. Well, inside a GitLab, currently, only the required fields are actually leveraged. So you have a provider, a user ID, and some basic info like a name. But as part of that schema, there are additional fields inside of the credentials hash, inside of the auth hash. So there's a token, that's an access token. Secret is a deprecated value from OAuth 1. And then a Boolean for expires, and then it expires time. So we take advantage of that credentials information as provided by an OAuth 2-type identity provider. But we also actually support SAML and other providers. Of course, it required a fair amount of tinkering around, which, again, if you're interested in the details of that, you can come talk to me after. But fundamentally, we're taking advantage of this authorization code flow from OAuth 2. So steps 1 through 9 here, it's all standard OAuth 2 code flow. So I won't dive into it in detail. And in the demo, you'll see me walking through these steps. Really, the key enhancements start at step 9. So once the provider sends over some information, you get a payload of auth information. As I said before, GitLab just grabbed the provider, external username, and that's it. Really, it was merely meant to handle social sign-in. But we want to do more with that. We want to take advantage of the credentials information that's provided. So in step 9, we extract that credentials information, put it into the database. And now, when you want to trigger CI, you can actually pass down that user's ID to the server. So you can see that it's all the way down to the runner. So you can see there in step 10, all that data is passed down to the runner. And now, whereas before, you would just have the runner execute the CI job and send the response in step 12, now you can actually do step 11. The runner can communicate directly with the identity provider. So now you can see where that three-body administrator problem kind of comes into play here. Before, you'd have to get the token register, and then you just have this implicit trust all the time. But now, we give power to the server admin still has the same power they had. But now the runner admin has more power to say, you know what, maybe I don't want to run that job. I want to communicate with the identity provider I trust first. So they can take that access token and validate it against the identity provider. I won't go into the digital identity guidelines stuff too much. I'll just briefly mention it because you'll see some numbers in the demo. But we try to be NIST compliant. So there's things called identity assurance level and authenticator assurance level. It basically just says, did I show you my driver's license when I got my form of authentication? And do I use a crypto card or a UB key? Something that's a higher level than just username password. And another critical aspect that unfortunately, I'm not going to have time to dive into in this version of the talk is how we really audit the Federation events. But I will briefly show what we can see. But suffice it to say, it's enough to know that our enhancements actually directly take advantage of an enterprise edition feature called the audit events. So every Federation event literally looks like an audit event. All right, so I'm going to do a live demo now. The progression is going to be I'm going to show this transition from what a normal CI job looks like when you use just a normal shell runner, then we progress to a set UID enhancement, then full Federation time permits, they'll show some auditing stuff, and then of course lessons learned and where we're at now. Okay. So this first part is going to use a bunch of virtual machines that I have. Let me zoom in here. Make sure you guys can see that's maybe a little bit too big. That's still good. So this first part is all a bunch of virtual machines that I have running. So you'll see how the 1026 for all of them. Okay, so in my example repo here, I have a very simple CI YAML file. Okay, so very simple CI YAML file here. I have a tag that associates it with one of those VMs, which is running a very simple standard shell executor runner, and I'm just going to run simple command ID. I want to know who's running the CI job. So if I come over here to my pipelines, I look at the latest job, I see as I expect root ran the job. That's because the runner's running is root, and so of course the CI job is going to run as root. So the next iteration here is to take advantage of set UID. So that's where I'm going to actually try and change the user that's triggering the job to actually run it. Let's test. I'm going to cheat here for a second, actually, because I need to make sure I get the right tag. You can see here as part of the demo, there's quite a few runners here. GitLab runners at UID shell. GitLab runners. Yes, okay. So same thing. I'm going to run a very simple command here. Just want to know who's running the job. Okay. So now if I come over to my pipelines, oh, it's failing. Okay. Well, it's telling me because there's some unknown user DINIC trying to run the job. Well, that's because set UID, at first pass it, you know, moving towards Federation is literally just taking my GitLab account, DINIC. So that's my username in the server. It's DINIC. But if you look at my username here, yes, it's in a VM, but I kept it the same as my laptop. I'm just, I'm DINIC-alive. So the user names don't match. All right. So let's very quickly make this job pass. Okay. Let's change the CI file. So now I'm not going to use a VM. I'm actually going to use my raw laptop, if you will. Okay. So if we come down to pipelines now, ooh, maybe I got the tag wrong. Well, that's why I have this open. Fuego set UID show. Oh. I think maybe I hosed my config Toml file. Oh, no, that's not why. My favorite issue, I need to be running as a root. Hopefully that's the only snag in this demo. Okay. Yes. Okay. So now you see the job passed as DINIC. Well, for the purpose of this demo, you can see I created a username DINIC just so it would pass. Okay. So you can see there's some limitations there with set UID, right? The fact that it's grabbing the username straight from the server. So now let's move to federation. So federation is going to tap into Omnioth, like I had mentioned. So I come to my profile account page, and you'll see the social sign in. Another way to get there, you come up to settings, and then you click account. So I have two identity providers here. I'm just going to show off the ones. Very simple, JSON web token-based IDP that I have a VM running for. All right. So you can see I have an active session now with my identity provider. And now when I edit my CIEml file for federation, you can see, again, I've got this tag specific to that federated runner. One of those VMs I have is using a federated runner that needs to talk with that identity provider. And I'm going to do, oh, the ID. Who's running the job? So now when we come down to the pipelines, you see it's running, federation passed. And this time, we see it actually ran as Dnicolif, which is not my GitLab server username, it's the one that was provided by the identity provider. So all as well. I have delegated the control down to the runner to talk with the identity provider and say, you know, who should be running the CI job? And so we've kind of, you know, elevated the role of Omnioth to manage user access control and fundamentally tap into the identities table that exists in GitLab. Okay. So make sure I'm on track here. I showed you guys the shell runner, set UID, moving to the federated runners. And okay. So now I'm going to show you sort of our pre-production testing environment. So this is no longer on my VMs. This is at austi.gov. It's our development instance. So here you can see the social sign-in. We have a few. One ID. This is a really cool authentication hub where we can tap into all the national labs, really. Not all, but many of them. I'm not going to use that for this demo here because I don't feel like grabbing my most crypto card out. But for the case of multiple sign-ins, I will show you guys NERSC quickly, since it's just right across the bay. Pull out my one-time password for MFA. Just NERSC here. Okay. But for the actual next step in the demo, I want to show you something that happens with Google. So I'm going to also activate my Google social sign-in. And let me just make sure that this is indeed broken purposefully. No. Okay. So I want to show you a failure first. I forgot to turn on the failure on purpose. So all right. So now when I come to my project, same deal. I'm just going to run ID. I've added some other stuff here. These are some variables that actually are passed down to the runner, but it's not really important to see. So I'll remove that so we can trigger the job easily. Okay. So now, you can already see here on the runner, something's failing. Well, it looks exactly like that failure we had with SetUID. Namely, it's trying to run the job as some user that doesn't exist on the system. So it's trying to run as user 116487. Well, that's my Google user account. So I, of course, don't have a local account that matches my Google user account. So how do I fix this problem? Well, as I showed in the authorization code flow diagram, really wanted to give a lot of power to the runner admins. So what you can see here in the config.toml is that we actually allow you to specify a validate auth script, which I'll show you in a second. Or actually, I might as well just show you this right now before I do the run. So that shell script just calls a Python script, which is very simple. Okay. So for this example, very basic, just have this white list that is going to say, oh, okay, I'm going to remap this user ID to the local one that I care about. We also take the access token and hit the endpoint that we need to validate against. So in this Google example, you can see I'm going to actually communicate with Google and validate the access token that they give me. So I'll restart the runner now with that validate script, retry, and you can see the job has succeeded on the runner. And if we refresh the s, there you go. You can see that not only did it remap my Google user account, which is that long integer to the actual local one that I cared about. Again, an operation performed by the runner admin. But in addition, as I showed you in that script, we took the access token and hit Google's endpoint to in fact ensure, hey, this is this is a legitimate session that they have with Google. So showed you the federated runner stuff, showed you at Austin. As I said, we really focused on what the runner admins can do with the credential information, showed you that script here. I still have a couple minutes, so I'll just very quickly show you some auditing stuff. So this is tapping back into my VMs. So if you have enterprise edition, you can actually tap directly into the GUI, you go to admin audit events. But we also dump to standard logs here. If you've never played with this before, I highly recommend it. It's a very cool parser of JSON. So you can see here, we have full entries in the log. We give these unique identifiers to each federation event. You can see, oh, hey, a CI job passed. Perhaps we put a little too much information in it, but you can see here under the target details, okay, it's a type CI, here's the job ID, what was the project, project ID, the commit hash. So if anything malicious does happen in your federated environment, you can always audit and trace back and see what to do about it. So I believe that should cover everything I wanted to go over for the demo. All right. So a few seconds left. High level lessons learned. I can see my notes. Okay. This is perhaps more of a personal reflection on the project. But I feel like one of the biggest issues we had with trying to get to the state of federation that we're at was not really having a full formal specification up front. Security is really challenging, right? They're just, you can't expose every part of your attack surface. An attacker will eventually, but you know, you trying to find that is very hard. So it's nice to have a high level kind of description of what you want, but I think it's worthwhile to take the months necessary to really expose as many parts as you can so that when it comes time for development, you don't expect anything more than just a few iterative changes. We had a challenge where, you know, we went one way with set you ID in the beginning and it ultimately didn't really mesh as well with federation as we wanted. And so it took a lot of catch up to kind of evolve it to where we need to get to. This next lesson is going to be kind of preaching to the choir hopefully, but really you should just test, test, test. We didn't start making like exponential advances in our development until we wrote that full virtual environment had Docker containers testing everything. I mean, we started to run CI CD on our CI CD enhancements. So I mean, I can't undersell how important it is to just test your own tests. And then the third biggest lesson, especially since we want to upstream this to the community is getting involved early. Again, one of the sort of failures, perhaps early on was that we didn't engage the GitLab team early enough with the set you ID enhancements, but with federation, we've reached out to our points of contact, met with everyone in the community, we're here at GitLab commit. And so if you really want to make an impact on the open source community, I think getting involved as early as possible is really critical. In terms of current state, we have a lot of our ECP projects. I showed some of those pretty pictures of computational physics simulations. They're all starting to use our CI infrastructure. And we're nearly at a place where everyone's going to be running through our production OSD instance. In terms of next steps, the user experience is something we want to work on. So obviously going to slash profile slash account to connect your social sign ins isn't the most convenient way to do this. So we're looking at commit as assurance where you can actually sign your commits with tokens in real time. Code scanning is a big part of where we want to go as well. And actually tapping into new schedulers is another big deal. As I said, bare metal performance is critical to us. And some of the schedulers are getting a little bit older. So tapping into things like flux on the roadmap. So that's my talk. I hope you all enjoyed it. If you have any questions, please come up and talk to me after. Thank you.