 Hello everybody. So just a heads up if I sound a little crafty. It's because last week I was battling a cold and just sinus mess and don't worry I was COVID testing the whole way through and I finally got through it all just to get on an airplane and sleep with masks on and everything for the 15 hours and all the crop came right back. So I'm still working through that a little bit and so you'll just have to bear with me a little bit on this. So as was just mentioned I work at Docker now and I just started there about two months ago so still getting my feet wet and everything but prior to that I was a Docker captain for about five years and so I've been involved with the Docker community space for a long time now and done a lot of conference talks and in fact actually the DockerCon two weeks ago that was the first DockerCon that I haven't spoken in probably seven or eight years which is just seven or eight sessions of DockerCon so but so what I'm going to be talking about today is actually my work that I did while at Virginia Tech and while there I did a lot of software development and cloud and containers and helping kind of modernize a lot of the ways that we were doing things and my last role there was proposing architecting and I led the creation of a common application platform which is what we're going to be talking about today. You can find me pretty much anywhere online at mics or 87 on Twitter and GitHub and all that kind of stuff too. So what are we going to talk about today? I'm going to talk just a little bit about this common application platform because it sets the stage a little bit of what we're going to be working on today and then we're going to talk multi-tenancy because it's easier said than done for sure. Then we're going to talk about creating our actual landlord and then we're going to wrap up in Q&A and the source code for this will be found here and it's already there and we're going to be adding to it as we go throughout this session to you so I'm very much a big believer in live demos. We're just going to build stuff and try stuff live and so hopefully the conference Wi-Fi supports me. If not, we can fall back and that'll work too. So first off the goals of the effort. This common application platform was really designed to be a hey, we recognize that there are many teams that are deploying applications in many different ways at the university. Can we start to pseudo-standardize this now that containers have really given us a higher level abstraction point? Can we have a platform that builds on top of that idea and so we wanted to build this platform and make it available to all of our application development teams around the university and at the end of the day what we wanted was really to satisfy this. You build a container image, should be a container image, but bring a container, we'll run it for you and that was really what we wanted to do. One of the first things we had to do was identify kind of separation of concerns. What does the platform team own and what does the application development teams own and we pretty much decided that the platform team owns everything below the applications and what I mean by that is what we've got here on the slide here. Let me grab the laser pointer. Okay so the platform team would own any of the cloud infrastructure, networking, etc. The cluster infrastructure itself, we built on top of Kubernetes, any core cluster services so obviously our like ingress controller and Prometheus operators and all the kind of cluster wide services that are available, we would also manage all the node and compute resources so our development teams shouldn't have to think about machine patching or anything that comes along with that. And then from there we want to have an abstraction point where then the application teams can just say here's my app or here's my deployment, my service, my ingress or whatever and they don't have to worry about all the other glue that holds it all together. And so with this kind of separation concerns it would basically kind of say hey if there's new CVEs, new security updates, if it was with kubelet or Kubernetes itself or a machine node, okay that's on us, if there's a problem with the application in which a rest endpoint isn't validating users correctly, well that's an application concern and so that's in this blue box and that's up to the application teams. So we try to make it really clear on how this is going to work out. Then the question is how do we actually, how do we build this abstraction point? What does that actually look like? And we'll get into that in a second. So actually maybe before I do this I'd like to have a little bit of audience participation anyways. Who's ever tried to do multi-tenancy in Kubernetes before? Okay so maybe not quite a third. Who found it easy to do? That's the reaction I expected there. Okay so the first thing I want to answer is what do I mean by tenant? Because everybody's got different answers there and we kept the definition really, really loose and we basically just said hey application teams you tell us how you want us to carve things up and as we go through this you'll see how we made various design decisions around the kind of sandboxing around each tenant but really it's up to the individual teams to figure out. We had some teams that said all right I want just a dev staging prod and we'll throw all of our development one all of our staging in another one and all of our prod in another one. Okay whatever. We had other teams that said I want a different tenant per application and again we could support really any of them. To us tenants don't cost anything. It's just extra metadata, extra policy, extra procedure and that kind of stuff. So again we wanted to support our development teams and however they wanted to do development. And so kind of going back and see what makes multi-tenancy hard is thinking about what could possibly go wrong. There are a lot of things that can go wrong. If you just start off by saying we'll give everybody admin access and they can just deploy stuff anywhere well then how do you make sure that they don't step on each other's toes. We don't want one team to make changes that affect another team whether that's to change deployments or change ingress or whatever. You don't want one team accessing the secrets and config of another team. So there's just a lot of things that you've got to start thinking about how do we properly isolate things and there are various aspects built into Kubernetes you know the RBAC and role bindings etc. That can help with that. But there are some things that it doesn't cover. So for example if I allow teams to define an ingress object how do I make it so that team A you know doesn't get their traffic intercepted by team B if they define an ingress that stomps on the host name of team A. And so how do you just make sure I mean according to the RBAC sure everybody can create ingress objects but how do you get a little bit more granular that and we'll talk about some of that as well. As a platform team we also didn't want to allow teams to create their own node port and load balancer services but they should be able to create services. So how do we get more granular than that and we'll talk about that. And this is just a quick list there's so many more things that you've got to start thinking about once you start getting into multi-tenancy. Now this isn't a fully comprehensive spectrum here this is something I kind of just threw together and you'll see various opinions on this. When you start kind of diving into multi-tenancy you'll hear the difference between soft and hard multi-tenancy and really the difference is how much do you trust the people that are going to be deploying things onto your cluster. So as you go further on to the soft side it's the more trust and you know we trust everybody to behave well and for the most part that's probably true but again we're in an university setting where we maybe start working with students and I don't know if I trust all those students out there and or those students develop something they've got it deployed and now they've graduated they moved on and now some other professors having to maintain that and they don't have any idea what's going on. So we had to think about you know where in the spectrum do we want to go and as you move further over to the hard side of things you start sandboxing and basically removing a lot of the trust and you're putting more validation and you're putting just more constraints into place. It certainly makes things more secure but also it does make things much more costly so for example if I'm doing a micro-vm per pod well now I've got I'm running a lot more micro-vm's it's a lot more resources that I'm having to maintain but man if anybody busts out of a pod well they can't jump into somebody else's pod because well they're in their own little micro-vm at that point and then there's some pretty cool stuff going on virtual clusters I won't get into that but again there's there's a spectrum here and for you and your organization you kind of figure out where is it that you want to land what's an appropriate level of constraints and controls and also permissioning that we want to give back. So for us we landed at about this point where we didn't go to the full micro-vm per pod but we also wanted to support the ability to say hey we want to group various applications various tenets together maybe in their own node pool and you know I have one node pool for this team and another node pool for another team and again we'll dive into that here in just a few minutes. Okay so as we start thinking about all these different things and all the different configuration points we decided we're going to build something called a landlord and why did we call it a landlord well because we're in a college town and that's what most you know it's all apartments everywhere and it just fit the model pretty well but when you look at an apartment and you think about a landlord that's having to manage it there's there's a lot of things that they're having to keep in mind first off how many buildings do they have how many floors are in each building how big are the units you've got some maybe studios some you know three or four bedroom units you've got lots of different sizes you've got shared infrastructure in each of these buildings so do you you know do you trust everybody in the same building to utilize that shared infrastructure well or or you say no and then I'm going to create another building and put these tenants over in that other but so it just matched the analogy really really well for us and just FYI for those that are still staying there's plenty of seats over here too so feel free to cut over any point as well so for us when as we were thinking about the the goals in the landlord we wanted it to have quite a few different things here the landlord should allow us to define all of our tenants but also be able to do so in a way that's configurable okay this this tenant needs slightly different rules than another tenant how do we set that up we also wanted to be able to do everything declaratively and item potently so if I redefine all my tenant configuration and if I add new tenant and I read you know reapply everything it shouldn't shouldn't mess things up if if I'm uh making updates let's see we should also support version control and history and you know it's actually kind of cool because as we're going through this in many ways many of these principles are exactly what we heard earlier today with the open get ops principles that are coming out and uh no I didn't have a sneak peek at that or what not beforehand but it's it's neat to see how these line up but the last point here was something that was pretty big on us because for the most part our team didn't have a lot of developers um on our core platform team so the last thing we wanted to do was say hey we're going to create some crds and we're going to create a controller we're going to have to like manage all this stuff ourselves so we wanted to be able to define a landlord in a way that could utilize tools and things that we're already familiar with with again not having to do custom programming and and maintenance here so as we did some research and after actually deploying tenants a couple times we're like helm just solves all those fours and we'll see this here in just a minute but it checks all the boxes for us that we're able to you know theoretically create a chart that basically is our landlord helm chart and then giving it a custom values that defines the tenants that we need then we can customize each tenant based on on their specific needs and with that again I'm a big believer of uh programming on the spot here too so we're going to we're going to build a landlord helm chart uh together here we're going to walk through the process a little bit here okay so first thing's first well every tenant needs a namespace that hopefully everybody at least knows that at this point okay you should give every tenant their own namespace um so if we have a kind of values uh yaml file here this is just a sample um I have a top level key of tenants and then I've got team awesome team cats and team dogs here um and so just starting from here I can have a helm template that what it's going to do is it's going to do a range that's basically going to iterate all through all the tenants pull up the key value pairs and I'm just going to basically create this namespace object once for every tenant okay um pretty simple here and let's actually just do that so first thing I need to do is actually a create a new window here and I'm going to create a new chart so we're going to literally start this from scratch create a landlord chart and I'm just going to remove most of the stuff that's in here I did that backwards let's get rid of all that okay let's clean that up too okay so where is I'm going to create a namespace yaml paste that in and I'm going to rename this to values sample tenants team awesome team cats team dogs and now pull this up a little bit if I uh template this out we'll see that as expected I've got namespace for team dogs one for team cats one for team awesome okay so at this point I've got a really basic helm chart I'm super on the soft side and I would just assume I'm going to give everybody credentials to this and hooray I've got a you know landlord or multi-tenant hooray um obviously it's not a good setup but um let's go ahead and install this oops I do that all the time all right so that's been installed and if I look at my namespaces I see my my three team namespaces there cool so let's let's keep moving through our spectrum here so the the next thing we actually have to ask is how are our tenants going to actually deploy things into our cluster and it'd be pretty bad if I'm at get ops con and then I didn't say get ops right um otherwise it'd be like wait why am I speaking here um so yes we we did uh decide to use uh flux for our tooling we actually had an internal bake off and uh this was back flux v1 days and we did you know flux and argon we had one team even just say well we're just going to hand out credentials which was a terrible idea but um and and so we went through the migration of flux v1 to v2 and it's it's been just awesome um so what we did is as a platform team we would provide so anytime a new tenant needed to be created we would create a manifest repository for that tenant um so we would create a repository we would hand that to them and basically say hey you've you've got maintainer rights to this and then whatever you put in there is what we're deploying so we we own those manifest repos and then we kind of give it back to the application teams to do whatever they want with um and so you know some app teams you know went through the full full ci pipelines we were a git lab shop and so they would take the git lab um ci and would clone that manifest repository push new manifest into it you know commit push it all that kind of stuff and and have it as part of a fully automated and then we had other teams that were like we don't trust any ci cd and uh and so we're just going to make all of our changes manually which okay cool so for us one of the the big advantages is is by using get ops again it allowed us to support all the different ways that teams wanted to to deploy things um we had other teams that wanted a little bit more change management process and so okay cool here's your manifest repository and they went through like a whole pull request review process um internal to their team again we as a platform team we don't care how your manifest get updated as long as whatever's in the main branch is what you want the one that's what we're we're treating as the source of truth however it gets there that's up to you um and so that that was one of the kind of really gratifying things for us as we started seeing more of our our customers using this that get ops was the right choice here um but in order to do this as a as a platform team there's there's quite a few different things that we need to do so for each tenant we need to create a service account within that tenant namespace that has access to be able to create resources within that namespace this is really important so that you don't accidentally say well hey in my manifest repository i've got an object that's going to actually deploy something in somebody else's namespace by creating a service account within that tenant's namespace it sandboxes them it keeps them within their their namespace so even flux can't apply things outside of their namespace then we create a git repository that fetches the source material and create a customization that then applies the manifest so here's some quick little examples here so for team awesome we're going to and again this is some of the things that we can adjust over time we have some teams in their manifest repos they just dropped on the manifest at the root while others had a whole directory structure and so we wanted to be able to customize the paths for each of the different groups so let's go ahead and plug that in here so what i'm going to do first is let's get all the um the rback plugged in here and so what this is going to do is it's again going to loop through all the tenants and i'm going to create a service account within that tenant's namespace and i'm going to create a role binding that then says for that service account we just created give it the admin cluster role access but since it's a role binding it's it's giving them admin access just inside that namespace okay and one of the reasons we did this is because there's a lot of other you know third party tool insert manager being one for example that when they create um when they create cluster roles for example they they're using the aggregate functions of of Kubernetes so that now by having admin i can create these other objects as well too so this was this is a good thing to do here and then the other thing i need to just snag is flex oops our actual config objects here and so in this case we're going to create a get repository and a customization and in this case again for our platform team we created a separate repository for each tenant for demo purposes i don't want to have to create tons of different repositories and i also want each of you to be able to go take the source code and actually run with it without having to create tons of different repositories so that there is an adjustment here in which i'm actually saying okay the url is this uh get hub repository and then i'm changing that the paths that each of our tenants are going to be using so path and team manifests awesome cats dogs okay and so what this is going to do is once we apply this in this code repository here i've got a directory called team manifest which basically is a simulation of those individual team repositories and so in team cats for example if i open this up i'll see a certificate i'll see a deployment that's going to deploy just a silly little cats app that i got uh define an ingress etc okay and same thing for these other teams here so again this is just kind of a single repo simulation of what normally is across several different repositories now if i uh sorry i apparently make a lot of sound effects when i'm coding uh when i do the home template now i'll see the customization and i'll see all that being property replaced in so let's do an upgrade now this is the only time in which i'm actually dependent on the internet so hopefully it's working okay so the url i was there one up here okay okay it looks like it's working out i'll just hold that here just in case all right so we've at least been able to fetch the revisions and if i look at the customizations i'll see that they were applied and so actually if this worked let's get a new one to here cats let's see let me drop out a full screen here cats local okay and there's my app okay and so it just displays a random jiff hooray okay so the demo is working and it just keeps giving me a random jiff every time i refresh so and so if i go to dogs dot local it's basically the same app but instead i get dogs now and uh and so that that's and it's working so again hopefully you see that just by figuring out how do i template out the the tenant configuration now i'm able to just kind of spit these out very quickly okay so let's talk about locking things down just a little bit here quick show of hands how many people have used gatekeeper before okay so there's there's quite a few hands i'd say about half okay so gatekeeper first off the first thing i'd say is don't try to write your own policy engine uh don't just don't um gatekeeper for those that aren't familiar is a basically a kubernetes wrapper around the open policy agent engine um so use an opa or open policy agent you can write policies using a language called rego and then what gatekeeper does is it wraps it around so then it's basically an omission controller so all requests are coming through the kubernetes api to make changes to state whether to create a new object or delete an object whatever um gatekeeper can be notified and um you can basically add your additional policy so remember earlier how i was saying i want to allow teams to create services but i don't want them to be able to create node port services well so gatekeeper lets me do that i can write um policies and and that as services are being created will look against the policies and say hey should this thing actually be um accepted or not okay so one of the things that i want to do is as a platform team we create a gatekeeper policies that basically satisfy all the pod security standards that at least the baseline um level so we didn't allow any of our tenants to run privileged pods or to mount the host volume uh host file system into the container use host namespaces basically all those kind of standard pod security policies and so what we want to do is say all right landlord you're in charge of making sure that all these policies are actually being defined and again i'm not going to try to make this a full gatekeeper uh talk here but once i've defined the policy then i can create an object in this case a kates pss baseline privileged container object and and then i'm going to apply it to all the namespaces that are being used by tenants and so within these namespaces it's not going to allow any privileged container to run okay and then after that then what we can do is we can have some parameterized ones so again going back to the idea earlier that we don't want to allow teams to to take ingress names for which they're not authorized to use well that's just another gatekeeper policy at this point gatekeeper policies allow us to provide additional parameters to them to say for this tenant you're allowed to use domains a b and c while another tenant may be allowed to use x y and z and and so when the policy runs it'll use those um those parameters as part of the the policy enforcement and so for example i've got team cats over here that's able to use cats dot local dot mike siri seven training here now if i a in the the dogs try to use this name um assuming obviously it's being cut off a little bit here but that they they wouldn't be allowed to use that and so when the policy runs it would um prevent anybody else from using that domain name and so gatekeeper again being an emission controller would totally block that from happening so let's uh let's actually do this here so in my get repo here and again you're welcome to try this out let me close some of these out i have a gatekeeper policy chart one of and one of the things i'll say to you is this is just a sample don't take this gatekeeper policy chart and say hey michael everyone said use this and therefore we're now secure because we're using his policies heavens know okay um this is just a sample but we can look at just as an example here that host file system when here's a some of the rego in which we're going to extract all the volumes out from the the object and we're going to look to see if it has a field host path so if it has a field host path then hey that that's going to be tonight here and we're going to prevent that from happening so this chart just defines all those policies and all the rego one of the things so most of these i inlined uh directly into the object but one of the things that we learn as a as a platform team was actually made a lot more sense to extract the rego out into separate files uh separate directories here so we for example the authorized domain policy that i was just talking about we have that extracted into a separate file here and the big reason for that was to be able to support ci cd so that anytime we make changes to any of our policies our ci pipelines could actually run all of our tests for it and then as part of the helm build it would just extract the file inline it uh into our helm chart too so there there's an example of that happening here as well too if anybody wants to dive into that anyways okay so let's go back to our landlord chart let's get all of our um let me grab okay so i'm going to create a template called gatekeeper pss for pod security standards and it's just going to apply those this pod security standard objects again this is only three of them there's more to the pod security standards um and then the last thing i'm going to do is one more file and this will be the authorized domains and we'll go over here and we will put domains cats.local makes 37.training does that well we'll just do one tenant here okay we'll do a helm upgrade and looks like i've got something dark done oh whoops i had to make one more change here um tenants so one of the things that we we did from the start is we would authorize domains for their tenant namespace dot some identifier um and so in this case what i'm going to do is basically say hey team awesome is automatically authorized to use team awesome dot tenants dot local dot makes 37.training so basically here's a default domain that you can use and you don't have to worry about um anything else and so we would put this within our own cluster dns and all that kind of stuff so if teams just want to prototype things quickly they could do that okay now let's do our helm upgrade and what we should be able to do now is if i i'm going to grab back over here and let's uh let's go to team cats and i'm going to try to break this a little bit and i'm going to try to just run a privileged pod here and normally i would i would commit this i'd push it up you know let it go through the get-offs thing but just do the time and uh what not i'll just do it manually here um so team cats and oh whoops that shouldn't have worked um hold on say that one more time yeah that shouldn't matter um wait oh gotcha i see what you're saying i put that file in the wrong spot okay so for those that didn't catch i made the the gatekeeper pod security file and it wasn't in the templates directory so i just accidentally had it in the wrong spot it didn't get it didn't great that policy now let me delete that first okay it's still being deleted but anyways now it will actually block it okay so again you will just want to kind of think about what policies you want and then how can you plug it into your your landlord to template it out now one of the things i just want to mention real quick and i know we're we're short on time here is breaking up workloads and one of the things that we did is again using node pools and um you know kubernetes doesn't actually have a built-in and node pool thing okay but you can use taints and tolerations node affinities node selectors to start building out node pools yourself and basically um have that capability so one of the things that we did was we we basically said all right hey we're going to create a node pool a here that has taints and labels that reflect team a another one here that's team b and then we're going to mutate pods that have the tolerations in the node selector to put the pods where they need to be okay so node team a's pods will go to team a's nodes team b's pods will go to team b's nodes etc and there are a variety of different tools that we can do this but we we quickly adopted carpenter and we're a big fan of carpenter because carpenter allows us to define all these node pools and the provisioners and everything using config and and so as an example here here's a provisioner that i can create the taints in and put the the various taints and the labels that need to be on those specific pods and then to force the pods into the node pools gatekeeper also can be a mutating controller and so what we would do is we'd say hey for this particular tenant they're going to be on the node pool named demos and so we would create a gatekeeper mutation that for all pods that spin up in that particular namespace they would have the node selectors and the and the tolerations to put them into that pool and so what that means as the application developer teams they didn't have to know nor care that this was happening behind the scenes and was was really powerful so again the teams didn't have to add these taints and tolerations and node selectors and all that kind of stuff we did automatically for them just by running on the platform and again this was something we could script out and just put into our landlord just wrapping up here to a couple other things that we did in our landlord we we ran file beat in our cluster and we would gather up all the logs and send them off to Splunk where every team had different Splunk indexes that they wanted to use so cool we would just say tell us the annotations that you want and we'd mutate those on as well and and then so all pods that spun up in their their particular tenant namespaces would would forward off we also integrated our back with our our IDP on campus as well and this is a pretty common pattern we use a cubo IDC proxy and kind of go from there too so and so one last thing obviously in this demo I was doing everything with just helm install helm upgrade obviously you can use a helm release and go that way and this is the way that we we definitely did it one of the things I would recommend is we would actually take that value sample in and deploy as a config map and then using the home release values from reference that config map and deploy from there and the big advantage for that is then we could take that config map that raw values and do a helm template and render out just like we were doing here so with that get ops is awesome and the landlord has helped those be able to scale up and down as tenants need and be able to you know spin up workshops you know great we're gonna do a workshop with 50 tenants and it great we can just spin that up and go from there and with that I thank you hey at the forum three folks are here can you get mic'd up thank you any questions for Michael yeah we have like two minutes so about like you say use gatekeepers basically like admission webhooks so like I always like maybe it's like just like like in the github's pattern thinking right like I ultimately would block then the release of my my tenants that would use a invalid configuration right so what's your preparation like actually blocking the full sync or allowing it into and kind of like and then afterwards having a control understand I cannot just take it and do something with it or I need to ignore it yeah that's a good point so one of the things that we did is we actually built a dashboard a web-based UI where that they all the tenants can log in it was basically kind of set up for a more get ops flow where the dashboard's fully read only and here's your deployments and all that kind of stuff but one of the things that we did is up in the top left corner would basically was a status of the customization so it would tell you what commit had sync was it successful or was it not and so if you committed it like an ingress object or a that user a wrong domain or a node port service you know anything that invalid is yeah the customization sync would fail because the policy prevented it from being applied but then we would have that right there on the screen and and here's the the error message you got from gatekeeper and and one on two so so yes we did block that sync from because the mission controller blocked it but then we also gave them here's the visibility on why it didn't happen they didn't have to drop into the cli to figure out yep already and I think we'll we'll switch over but uh if oh yes hi uh what is the real advantage of having a separate node pool for per tenant as compared to entire cluster yeah so that that's a great question the we had a couple different reasons to do some teams didn't trust other teams to be running on the same nodes with with them and so that that was part of it so how much cross team trust was there but also we use separate node pools also as a cost accounting measure so we would spin up each of the different node pools and put the cost allocation tags since we were in an aws and then we could know hey this tenant as cost knows this much to run workloads on on our infrastructure oh gosh why not multiple clusters because we didn't have a big enough team that wanted to run multiple clusters yep awesome all right thank you all ready thank you everybody