 So Kyle and I are here today to talk to you about some of the experiences that we've had at Northwestern Mutual about running GitLab in Kubernetes, AppScale for the enterprise, and running GitLab CI through there. This is the thing we all click and just accept without reading, so we'll move on from that. So like I said, we'll kind of go through some of this stuff. At Northwestern Mutual, about five, six years ago, we started going through this digital transformation where a lot of our leadership realized we need to modernize our applications and our architecture, and we wanted to shift to being more of a software company that also sold insurance and financial planning and services like that. So in order to kind of facilitate that, the company took a step back and we took a group that really started piloting more of these, started using Kubernetes and piloting more of the Dockerized containers and stuff, and at the time, we needed a new source control, something that would actually work with how we were going. So GitLab was the one that won out, the big part of that was being part of the open source community. It was a huge draw for us, allowing us to contribute back to help get things that we needed in and go faster. So while we were doing that and everything else was getting modernized, we figured, well, why not do it for our SEM as well? So had the crazy idea of let's throw GitLab in Coob and let's see how this goes. So off the bat, obviously, we all know Kubernetes, know it, love it, gives us such a great ability to run applications in a highly available architecture with outhapping to think a lot about different auto-scaling groups and yada yada and all the nitty gritty of actually running all that, just easily deploy and you're kind of set to go. However, back then, nobody was doing that for SEM. Multiple people, even from other companies, were looking at me like, you're doing what? Why? Generally, my answer was, well, why not? But back then, obviously, there was no documentation on how to run GitLab in Coob. You have all the little quirks of having to figure out the resource usage. There was only the Omnibus container back then. JSON Plum hadn't broken apart all the different parts of GitLab and created the Helm chart to make it even easier. So we had to deal with this container that had everything packaged in and we had to try to pare it down to do the certain parts that we needed. Otherwise, it just took forever to start up and it just was a resource hog. We also had to figure out how are we going to do this with some of our planned volume of having nearly 3,000 users and until this is an old slide, it only says 21K repos. I think it's pushing 40,000 now. And then we want to make sure how do we best leverage the AWS services to make sure that we can ensure that if we have to do something to change back and that we run into an issue, we can easily flip back and forth. So up next, this is what we came up with 2016. Don't take a picture because you don't want to do exactly this. This was four years ago when people were like, why not? So obviously it's super high level but we have our Kubernetes cluster there and we're running all the Omnibus containers in there. We kind of had to break them apart into ones more dedicated to the API and web traffic. The ones that were more dedicated to get functionality of our SSH, ones for sidekick and so on in the container registry. But because we were just kind of figuring it out as we went, we also needed this, we also had an Omnibus installation on an EC2 as kind of a fallback. So in order to help facilitate that, we stood up our Redis clusters using Elastic Cache. We leveraged S3 as much as we could as new things came into GitLab for like LFS, the artifacts, uploads. We would leverage that as well so that it would help facilitate flipping back and forth if we came into a problem. We broke out Gitaly as soon as we could. Initially that was just the NFS and then we started running Gitaly on there as well in order to kind of again flip back and forth, leveraging AWS Elastic Search and yada yada and so forth. Some of the things that we ran across that I want to make sure you can avoid, we've had this experience, we know it didn't work quite right, so don't make these issues as well. So EFS, obviously it would make life so much easier. You've got a highly available NFS managed service. If you've ever tried to manage an NFS cluster yourself, you'll know the pain of never wanting to do that again. Unfortunately, EFS does have a hard IOPS limit of, I believe, 8,900 IOPS. And that's spread across the cluster. So if you've got something that's provisioned across for AZs, it's all shared there. And Gitaly functions so fast that it can try to re-access a file while it's locked, while it's trying to copy across the AZs, which, if you've ever experienced that, it kind of causes everything to go tumbling over. So unfortunately, that's not an option. Cayam, which you've ever used as just Kubernetes IAM, this isn't so much an issue with Cayam itself. It was more kind of a learn on our part, because since it was long ago, we kind of forgot to put the resource requests and limits on there. And because of the omnibus container was so resource hungry, it would starve out Cayam on a node, and then Cayam would go into a crash loop, which would then cause GitLab to get these sporadic 404 errors really sitting there crawling through our Kibana, searching logs, going, what in the world is going on before we finally got out there and realized, OK, we kind of shot ourselves in the foot a little bit on that one, went back, reset that, added the limits to make sure that the omnibus containers were properly being scheduled by Kubernetes to stop the constant pain of the little quirks there. This next one is a little bit of a ped scratcher, probably for most, because you're wondering why would NFS be an issue. Really, who here has actually managed a large-scale NFS? OK, exactly. There's almost nobody. We've got one guy in the crowd. So if you've never done it, NFS is actually a pretty memory-intensive app when you've got a lot of activity going on. I didn't know that. So I tried to be a little overly clever and run Gitaly in a container on NFS. And if you've ever had NFS go into swap, you know it doesn't really like that. So what would happen is the NFS server would fall over, it would GitLab would lose the connection to it, and you'd have to restart both. So GitLab would crash. It took us a while to realize it was going into swap. And again, it was really just us trying to be a little overly clever, save some money, run these both on the same instance so that we could have the Gitaly container just directly mounting the EBS volume attached to the EC2 and then try to save that with the NFS. So really, this NFS isn't a problem. I would recommend, obviously, just using Gitaly and just completely ignore NFS, but the only thing you need it for anymore is Pages anyways. And there's other clever ways that you can get around that now, too. So just kind of recapping EFS. Just say no. NFS, if you are going to use it, just make sure not to run Gitaly in a container on it and then overuse your memory resources. And obviously, Kubernetes is the best practice always to set your resource limits on everything. Going forward, some of the stuff that we're going to be using, the Helm charts. If you haven't kept up with what Jason Plum has been doing with the Helm charts, it makes it. If I was starting all over again, this is exactly what I do. It's so much easier. They just added Geo to it now as well. So if you're like us that you deploy GitLab with GitLab, which kind of creates that wonderful chicken and egg issue, you can now leverage Geo. And you can have everything in the same cluster. You just throw them in two different namespaces. You can redeploy the primary and then the Geo part can be the read to keep it up. So while the other containers are re-spinning up, your users never lose access because GitLab's still up and then it can cycle through. Really, at this point, it's just you change a value to bump up to the new Helm chart and you're set to go. It's not a mess of trying to make sure your configuration is great against all these different repos in order to redeploy. Super simple. That is the recommendation. If you want to do a GitLab in Kubernetes deployment, this is the way to go. For us, EKS is another thing that we're going to look at because honestly, managing a Kubernetes cluster gets a little tiring after a while, especially when you've got 30 other things that you're trying to manage to. If you can cross one off, I mean, AWS does a great job of it, why not let them handle it. All the other cloud providers obviously have their own flavor of this as well with Azure and Google and so forth, DigitalOcean2. Just make life a little easier for yourself. Now, the one thing that's always kind of the trick and still is, Gitaly's still kind of a single point of failure. It's not highly available yet. There is some stuff coming up. I believe the 12, 7, 12, 8, they're going to start introducing some more. But being able to scale your data across multiple availability zones has always been kind of the trick, like how do you do that? How do you make your data highly available? So we've been watching this CNCF project, Rook. I believe it's now a graduated project. And it kind of does that. It's leveraging CFFS to span across your entire cluster. Makes life kind of a little easier now that knowing that you could lose, technically, if you've got a deployment across three AZs, you could lose two AZs and still be up. Personally, for me, that's exactly what I've been trying to do. It's one of your personal goal. I want to hit the 5.9s club without spending the money to actually do the 5.9s club on an app. So that's one we've been looking at. The only other thing that would be really cool, and I don't have a slide for it yet because I was just talking with the guy not too long ago, it's a product called VITES. They don't support Postgres just yet, but they do. It's another CNCF project. And it's actually the type of database that YouTube uses. It originally came out of them. After they were Google purchased YouTube, they were immediately put onto the board. And the way they wrote it to work on the board, as Kubernetes was coming up, they're like, oh, well, this already works. So they were actually labeled as ready for Kube before Kube hit a 1.0. But with that, those are some of the things that we would definitely take a look at. Rook, like I said, we're still POC-ing it. I'm not endorsing it. Take a look, play around with it, see if it works for you. Otherwise, go ahead with the helm charts. And with that, I'd like to hand it off to my co-presenter here, Kyle. Hey, thank you, Sean. All right, so I'm going to pivot just a little bit. Sean gave us a great background on how to deploy the GitLab application in Kubernetes. And I'm going to talk to you a little bit about how to deploy GitLab CI in Kubernetes, and compare it and contrast it a little bit to some of the other offerings we have in GitLab CI. So I guess just to set the context a little bit, why much do you want to run your own runners? If you're on gitlab.com, you get some free minutes. That's cool and all. But maybe you want runners that are a little bit more performant. That might be one use case. Maybe you want runners that have some extra privileges. I'm going to get into that a little bit more later. Or obviously, if you're self-hosted, you're going to need to provide that workhorse to actually run your CI job. So you're getting that command and control plane out of the box with GitLab. But then you're going to need to bring your own farm to actually run your CI. So if you decide to make that jump, you're going to be first faced with the choice of, which executor do I want? So there's a bunch out there. There's actually even another one yet, the custom one. But immediately off the bat, we can rule out some of the ones that are designed for local use if we're talking enterprise scale. So we're going to want something that autoscales. And that's going to bring us in to focus on really the Kubernetes executor and the Docker machine executor. The others are more there for experimentation or local dev. So here's what those two look like from an architectural perspective. On the left, we have the Docker machine executor. And the key here is that every build is going to run in its own instance. So when you spin up your runner, it's not actually doing any work. It's really just delegating that out to other machines, whether they're EC2s or droplets or whatever your cloud provider is. There's a whole bunch of ones different supported. But that also has a cost to it, too. So those machines take a little bit to spin up. They're kind of laggy. But you get a great isolation. So you don't have to worry about noise in neighbors, things like that. Let's flip over and talk about Kube for a second. So in the Kube land, every build or every job runs as a pod. So now we can get a little bit better density. We can get multiple jobs running on a single node. And that's going to get us a little bit better performance, I guess, per cost, if you can tolerate the fact that you might have some noisy neighbor problems. So I covered some of these trade-offs just to recap. So immutability, immediately out of the box. We're definitely not going to want to use the shell executor and enterprise scale, because you're going to have different teams stepping on each other as the underlying OS is mutated. So immediately you're going to want to go to that Docker executor. I think that's like step two in the CNCF roadmap. So get on Docker right away and then make your decision as to whether you can do some capacity trade-off with running in Kubernetes. And it's OK to get that density. Or if you really need isolated workloads, then you can stick with Docker Machine. If you're on Kubernetes, one thing you might want to do is just kind of a pro tip. Consider separating out your workloads so that all of your CI runs in its own node pool. You can do this to try and isolate things so that if your CI kind of goes off the rails or is over provisioned, the rest of your apps are impacted. We can do this through some classic Kubernetes primitives, mainly node labels, taints, tolerations, and selectors. So the first thing we do in the config here is we're going to label our nodes as dedicated to GitLab CI. And this is something we can do to say, when a pod spins up, it'll be attracted to that node selector. So this is one way to gravitate all the GitLab CI pods towards the nodes we allocated for GitLab CI. That's only half the problem, right? So we also want to repel the regular jobs. Let's say you have your web app running on your cluster. So we're going to tank those same nodes and say, we want to repel. Don't let the jobs that are running our web app actually go on to the CI nodes. And then finally, we follow that up with a toleration to say, yes, the CI, we're going to allow those workloads to flow onto those nodes. So between those four, Kubernetes concept, working in concert, we can actually shift the CI workload onto its own dedicated nodes. And that opens up all kind of cool possibilities, like, for example, if you want to change your autoscaling policy and apply some fine tuning to your autoscaler, that opens the door for that. OK, so let's say you've got your Kubernetes executor set up. How can we use this to boost our security posture? That's the kind of story I want to tell today in the last part of this talk. So here's my straw man use case, if you will. It's a pretty classic one. But we've got a pipeline. And at the end of it, we want to talk to a managed service. That could be S3. Let's say we're deploying a static website. Let's say we're publishing a customer download on one of the keynotes this morning. We looked at, we saw Kubeflow and how they want to interact with the bucket. So fairly common use case. But the point is, we're going to need to get some credentials in that pipeline. So the first thing we can do, probably the easiest, is to set up some protective variables. Everyone's probably done this before. You generate an IM user or whatever your cloud has for a basic security entity that's going to give you some kind of API token. And then you just start those in protective variables. You use them right in the pipeline. It's really easy for developers. And this is great because it's very portable. You can run this on any shared runner and you're good to go. The downsides are not really that secure. So they are protected, so you get some masking. But let's be honest, they don't change over. They're not going to be time-bound. So this is kind of risky. Let's say someone saves the creds off in your last pass. How many people here have generated an IM user before and use those in your pipeline? How many people have, let's say, rotated that credential ever? Lost a few hands. How about in the last year? Month, hour? So you get the idea, right? But as credentials are out there, if they're not rotated by some sort of automation, it's probably just not going to happen. So the next thing you could do to boost your security posture is move towards roles. So with role-based access, now we can take that same policy and apply it to an instance. In this case, it's a runner. And now through the metadata service, we can get time-bound credentials. So that's a step up now if someone happens to walk off with them or cat them out in the pipeline or echo them. We can still, so there's still the potential for a leak, but it's now constrained. So we've reduced that risk profile by making sure that someone can't walk off with them and have them be valid for a long time. And the cool part about this is it's still the same interface, so same command line call. We didn't have to change up what the developer does. But the disadvantage here is that now the runner has become a privileged entity. So we can no longer have this be a shared runner or you'd lose your security. So instead of having a shared runner, we now have a runner kind of tied to a particular use case, let's say. And this kind of gets out of control quickly. So let's say you need to spin up a runner for every single environment, every single project, every single security domain, every single policy has its own credentials. That's kind of crazy. So this works well for maybe select specific use cases. But like Sean was saying, we have thousands of projects. We have hundreds of groups. It's not going to work at the enterprise scale. So how do we find the happy medium? The baby bears warm porridge in this scenario. We want to get to a happy medium. And this is where Kayam comes in and has some advantages for us. Sean talked about a little bit Kayam. Basically, you can think of it like IM rolls for pods. So remember, in GitLab CI, with the Kubernetes executor, a pod is the unit that a build operates at. The other key thing to take away from Kayam is that you can control the roles through annotations. So if we put these concepts together, we can go into our GitLab CI configuration for the runner and say, you know what, if we establish a convention around the name of these roles and bring in CI variables to define them, for example, you could pick the project ID. You could pick group ID. You could pick the DNS namespace path. But just establish some convention and lock the user out from being able to override this. So if we set that up once as an admin, then from then on, every time someone runs a pipeline, they can define their policy against the GitLab project. Instead of associating a policy with, say, a role or an IM user, we're really just indirectly associating the IM permissions with the actual GitLab project itself. So we're treating that project like a first-class citizen in IM. So now they can run their pipeline and make the same API call, the same command line call. And it's going to get through the metadata service. Kayam intercepts that for us. And so there's really no change on the developer's part of how they're interacting with the service, which is great. So what happened here is we cut out that bottom layer in the chaotic matrix. So now rather than having a runner per security domain, we can still use the shared runners. But because of the rigid context that Kayam's providing, we're still constraining that security profile to a specific project. So this is good because we get that balance. We can use shared resources but still have tight security. So downside to this approach, if you go the project number route, or project ID route, I should say, they're kind of magic numbers. If you go the DNS route for defining your convention, you can run into some character limitations. But at the end of the day, this is just an ID I'm throwing out there. And hopefully we can get some feedback on it to see where we can improve. So just to recap, the idea is to take and federate a GitLab project and give that an identity in IM. And that way we can boost our security posture and not disrupt the developer by using the same API. So usually security can kind of be a zero sum game. In this case, we were actually able to boost the security posture without disrupting how the developer interacts. And with that, I think we got some time for questions. If anyone wants to ask about the CI aspect, or I'll open it up to Sean too for the app in Kubernetes, feel free to set them up. Not really a question, but just a couple of things. Because we did something pretty similar to this. Regarding the NFS, it's a bit more pricey. But actually, we got much better results from using an i3 instance and using the local NVMe for storing the data. So we will have two EBS in rate one. And we will mount them inside the i3 instance. Of course, you need more storage, and it's more expensive. But actually, it's still offering right now better results than using anything else. And that's actually how Rook works itself. If you look at some of their benchmarking, the way they recommend doing it was using an i3 instance with the NVMe and using that as a dedicated node in your cluster, or your three of them to spread across. And then it would use just those. And in one-to-one testing against a provision IOPS EBS, I believe the i3 pretty much blew away like a 5,000 IOPS EBS volume. And an i5 blew away 10,000 IOPS EBS. Right now, we are running GitLab. We are using the home charts right now to the GitLab. And we are using as well the locally attached disk to it. And it's like 10 times slower than anything that we got in Amazon. And one question for the CI thing. Have you considered using Vault for temporary provision credentials, because that's, for example, what we do for some of the deployments? Yeah, we've been looking at that as well. It's on our roadmap. We're fingers crossed. We're hoping for that first class Vault support. I know there's an issue out there. And we just haven't been able to get that traction and graded in the app yet. Yeah? Yeah, sure. Actually, we got really good results with the Docker machine executor in GCE. I saw in the slides that you were taking five to 10 minutes to spin up an instance. Actually, we got instances in GCE in like 10 to 15 seconds. It's pretty good. It's really encouraging. Nice job. Other questions for a wrap up? All right, thanks, everyone. Feel free to hit us up at the happy hour or offline if you have questions. Thank you. Thanks, everyone.