 Okay, we'll go ahead and get started now since it's 4.30. Hello, I'm Karen, and this is Sujith. We're part of the orchestration team at Robinhood where we manage our Kubernetes clusters. And today what we'll be talking about is going through the journey of how Robinhood's strategy for Kubernetes authorization and authentication has evolved. So to give you a little bit of context, Robinhood decides a company to actually start using Kubernetes at the end of 2018. And the design decision was to have our own self-managed Kubernetes clusters on top of AWS. And that's still what we are using today. And we went with a multi-tenant model, so each application gets its own namespace and every application has a corresponding team that owns it. And a team might own multiple namespaces if they have multiple applications. So, yeah, in case you can't see it, the top is the Kubernetes resources and the two boxes inside are the namespaces. And here at the bottom, I have Google groups because at that point in time, that was something that already existed in the company. Each team had a corresponding Google group and this Google group was used for things like emailing and also using Google calendars. So we decided to leverage the already existing Google groups and use that to get access to the Kubernetes clusters and associated namespaces. So because we wanted to have each team to have access to the namespaces they own, we created a cluster role for this, namespace admin. And that included things that a regular application team might need like managing their pods. And we also created a role binding in every single namespace to bind to that cluster role and use the Google groups. But this covers application teams but you might be wondering what about infrastructure teams that might have needs to access resources across several namespaces or even to manage cluster-wide resources. And for that, we used a cluster role called admin role which is very similar to the built in cluster admin cluster role which gives permissions to do literally anything in the cluster via the wild card verbs and resources. And the Google group on the bottom right is the infrastructure team and they are the ones that have access to this. So this was our initial design and that worked fine when Robin Hood was a small startup with very few engineers but as the company grew this initial design was no longer going to work. And one of the primary reasons for that is that our infrastructure team grew by 4x and more and this meant that if we kept our initial design there would be too many engineers that have full cluster admin access. So we needed to do something about this and change our security posture. So what I have here is a table comparing our original access policy and I say that in quotes because it was not a formalized access policy mainly because it was pretty straightforward. We just defined admin access as people who had full provisions to do anything cluster-wide and the people who were allowed to have this admin access was the, well I put it as plural but really there was one infrastructure team in the beginning. And non-admin so application teams just had their own namespace scoped permissions. And we created a new access policy because of the problem that we didn't want to give so many engineers full cluster-wide access and because this was more complicated than our original policy we needed to put it in a formal document and one thing we started with was to actually modify our definition of what admin access means. So we separated this into two categories. So the first category is first order admin access and what I mean by first order is that someone who has access via a direct cluster role binding reference has first order access and that's not enough for defining what admin access means. So we had a second category called second order and as an example of second order access someone who can manage the resources within highly privileged namespaces like CUBE system can be considered to have second order access because they can touch things like the API server. So after we redefined what admin access means we also had to set or decide who actually gets to have this admin access and we wanted to make it something more complex than just granting to infrastructure teams. So we had finer-grained guidelines for that and I'll talk about that a little bit more. We kept our policy for application teams the same because those were just namespace scope permissions and that was not a security concern. So the way we approached the problem between getting from where we stood to achieve our new access policy is we enumerated the things that we had to do and we ranked it in order of priority. So we wanted to tackle the solutions that would give us quick wins. So things that were low effort would give us high impact. And the first step to do this was to enforce fine-grained admin access and what that means is instead of granting full wildcard access to infrastructure teams is to actually create an allow list of permissions instead to follow the principles of least privilege. And the first step to do that was to actually survey the existing state. We wanted to see what different Google groups existed for the various infrastructure teams and look at where they were used in the RBAC resources. And to start with, we looked at a few open source tools. We used QB Scan, which was actually a great starting point for identifying what sort of permissions we might wanna look at more closely that teams had. But we did run into a few limitations here, one of which was that the results were noisy. So the custom cluster rule that we defined for application teams to use for our namespace admin permissions would show up in the results. So it was listing a rule binding in every single namespace so the results were very noisy. And there were also some feature gaps that we encountered where it didn't handle some edge cases well. We also looked at RBAC tool but that also had limitations. So we ended up just writing our own which was also useful for us to have custom logic like aggregating information across the different clusters that we have. So once we had the state of what it looked like then we could actually start with making those fine-grained permissions and turning them into the RBAC resources. And we started out with asking groups which permissions they actually need but I have this little question mark here because teams had no idea what they needed. The problem with granting people full wildcard permissions is that they don't really think about what they need access to because they can just do anything. So we relied on audit logs from the API server to see what historically these teams were accessing. And with that we were able to define cluster rules for each team and create the corresponding cluster rule bindings. And part of this exercise of surveying the different cluster rules that existed we actually found several cluster rules that were quite privileged and were no longer used. So we had to do cleanup of those as well. And next I'll hand it off to Sujith to talk about how we changed our access control mechanism for Kubernetes. Thank you. So far we looked at how our RBAC looked like how we have leveraged RBAC to grant permissions to people. Now I'm gonna talk about how we designed a better access control mechanism. But before we get to that, I wanna talk about where we were, what we had to kind of understand how we went on to design something new. So initially when we bootstrapped our clusters we used Guard, which is a third party webhook authenticator. So generally the data flow looks like users make kubectl requests and they pass in their ID token as a better token to API server. And the API server uses the webhook configuration which is passed as a bunch of flags in the Kubernetes API server configuration. It reaches out to the webhook authenticator and forwards the ID token. So the webhook authenticator in this case Guard is responsible for verifying the token, getting the relevant username from it which is email for us. And Guard uses Google groups as we have already established. So it first checks if the domain in the username field is a valid domain. If the domain is valid, then it makes another follow up API call to the Google admin directory API and lists all the groups that this email belongs to. And it injects these groups into the token review object and passes it back to API server which uses this list of groups and the RBAC configurations on the cluster to decide if the request should be authorized or not. So the question here is, how would users be able to get this ID token to present in their kubectl request? Conveniently Guard provided us a binary which is something you can invoke by running the command Guard get token dash or Google. Once you run this command, it fetches the ID token and refresh token which are deducted here for a good reason. The ID token is subject to expire but the refresh token enables the kubectl to fetch a new one whenever the issued ID token expires without requiring any user intervention. You may be imagining like, oh, why is the client ID and secret not deducted? Isn't it a secret? As it says in the field. Looks like no, it is not a secret. The reason being Guard is an OAuth client which is a desktop application essentially. So the client ID and the client secret are hardcoded into the binary itself and the client ID and secret are only used to get the ID token that the users can present when running the kubectl commands. So the best part about OAuth desktop app is it only runs against a local host redirect OAuth URL. So even if this client ID and secret are exposed, all that can be done is invoking a local host request and updating the kubectl file. As you can see in the screenshot, we have the client ID and secret are hardcoded. What does that mean? That means the client OAuth client is a single point of failure and with any given single point of failures, what the risk is if it goes down, it literally takes down everything and that's exactly what happened for us. So all of a sudden, all of our users started reporting that they're not able to fetch the token or unable to run any kubectl commands and the error very descriptively is the OAuth client was deleted. Luckily we have a break glass system so the admins were able to use this break glass to troubleshoot what was happening in the clusters. Unfortunately, Google Guard runs as a part on the control plan nodes but the logs, even at a trace level, were only giving very basic information like the token has been received for review. As you would generally do, we tried to turn everything off which is delete all the guard server parts and spin them back up but the issue was not resolved. So at this point, we started talking to the Google Cloud admins in the company and trying to understand what happened to the OAuth client. We were trying to look at the audit logs, we were not able to identify what happened to the Google client. A nugget of information that I want to add here is like this was set up in late 2018 like Karen was mentioning. So none of us were around when this was established so we had very limited idea of where this OAuth client is coming from. So that made it even more difficult for us to figure out. So then we engaged Google Cloud support but they were not able to provide any information. So what we ended up doing was we patched guard locally, created our own OAuth client, built a binary out of it and shipped it to every engineers laptop for them to start authenticating again. But our hunt to find what happened to the OAuth client has not stopped there. We started reaching out to the guard community to understand what happened. It's been a little over one year since we filed the issue and it is still open and we have no idea what happened to this OAuth client. As you can see through the slides, we are already reaching a point where the OAuth client situation is not ideal. You can also see that the tokens auto refresh. That means it is not necessarily that the tokens have short lift access. The tokens can be retrieved without any MFA. From an operational standpoint, it was very difficult for us to look at the logs or understand what is going on from the guard server perspective. Also, guard runs as a control plan component, so that's an additional add-on for us to maintain and you can imagine how cluster add-on life cycle works. So it's a lot of overhead for us to maintain and support it. So with this, we have decided to stop using guard. This calls for figuring out what the new system should look like. So all of us came together and chopped up the requirements for the new OAuth and OAuth Z system that we want to use. So the three key things that we were starting to look at are from a security perspective, we wanted groups, but we don't want them to be used for anything but Kubernetes cluster access. Then we wanted to have short-lived access and have MFA associated with it, but the token lifetime is a big challenge. We don't want it to be too secure, that it expires every minute when you're actively performing actions. We don't want it to be too long that it is going to be a security risk all over again. The third aspect is the reliability. We wanted more auditability, observability, as well as we want the system, the IDP, to be able to handle the load that is presented to it. So we narrowed down to using either Octa or AWS for this use case, and we also wanted to use OIDC, which is OpenID Connect. The best part about OIDC, as quoted here from Kubernetes document, is that the ID token, once it's retrieved, Kubernetes does not need to call back the identity provider, so there is no overhead for us in terms of managing another cluster add-on of any kind. So after debating and experimenting and evaluating, we have decided to use Octa, which is very standard with other things that we use. So the best part about API server is it allows us to use multiple authenticators. So we were continuing to, for the migration, we kept the webhook authenticator as is so that we don't disrupt people's workflows, but we added new OIDC flags to the API server configuration, which include the issue or URL and the client ID, which is for environment. It's a conscious choice. We could be more granular, but we chose for environment. And we used email and groups as the means of validating the tokens as well as using them for authorization. So let's look at the data flow now. So the user logs into IDP, then IDP provides the user with the ID token, refresh token and an access token. When the user makes the Qubes Studio calls, like before, they present the ID token. And as a better token, now the API server looks at the better token and it is now able to validate the Jot, which is JSON web token that is presented in the ID token. The reason that API server is able to do it is because during the bootstrapping, we provide a bunch of OIDC flags. So when it's first bootstrapping and starting to run, it sets up a handshake with the IDP and gets all the relevant secrets to be able to validate the Jot signature. So once it verifies that the Jot is valid, then it looks if the Jot is expired. If it is not expired, then we use the Google groups in the claims that we mentioned before to authorize the user and let them perform the actions. So this is how a sample ID token looks like. AUD is the OAuth client ID. The email is the email of the user who is performing the action. The groups are the list of groups that the email is associated with. But the question goes back to, how can user get this ID token and present it to API server? So to be able to support this, we used another open source project called kubelogin, which is essentially a kubectl plugin that we can use to perform OIDC login. So on the client side, they use the same issuer URL and the client ID that we have configured on the API server. And we try to extract the email and groups from this OIDC login request and make sure these are available in the token that they're present. So as you can imagine, with any new system, there will be challenges. So even for this new system that we try to build and migrate, we ran into some challenges. Surprisingly, not from a migration perspective, but essentially from a perspective of how do we do authorization? We already have these groups and these groups are exclusive to Kubernetes access, so we wanted to see how we want to fit them into the role bindings and cluster role bindings Karen just talked about. So we tried to do the same access, we tried to talk to people, tried to understand what they want. We also did not want all the cluster admins to have the same group, so we wanted to kind of break it down further and make it more granular. But the challenge with this is how would the group ownership work? Like, will the users be granted access as in their onboarding into the company? Or how do we do this? And also what is the lifetime of this access? So initially we made it easy for ourselves and we said everybody has persistent, privileged access. But very soon we are like, yes, this is a good step forward for the migration, but we wanted to make it better. So we took on ourselves a challenge on how we want to remediate persistent, privileged access. Karen? So now that we have switched over from using Google groups to Octo groups, that enables us to actually provide something called just-in-time privileged access. And the reason that we wanted to do this, so kind of going back to when we restructured our access policy, part of the fine-grained part of the policy for admin access is that there are some privileged permissions cluster-wide that infrastructure teams don't need all of the time, but they might have legitimate use cases for if a system is down, if they need to help a team debug, if they're setting something up, that's new. And so for that we need to provide these users away to just request that access temporarily only while they need it and then have that expire automatically. So I'll do a walkthrough of what that flow looks like. So up here on the screen, a user still makes an access request the way that they would normally to join the user group, the Octo user group. And then once they are added to that group, then they can refresh their Octo ID token to actually pick up that group membership information and then they can use cube CTL commands as normal to get the permissions that the group that they just joined has. And where the just-in-time system comes in is the automatic access revocation part. So we use the automatic group expiration tooling to actually periodically check if the user's access request has expired or not. And this is something that we are actually working on rolling out and onboarding currently. And the initial challenges that we faced were actually not related to the technology part. It was entirely cultural. We faced a lot of pushback from the teams that we talked to at first. And that's expected because who likes it when you take away permissions from them and add some friction to their process. So similar to when we were designing the actual Google groups, when we went and talked to the users to see how they felt, we did the same thing here. We talked to the users, asked them what their pain points were kind of presented like what we want to the user flow to look like and listened to their feedback. And we had a POC where we had people actually experience that user flow and we modified it based on their feedback as well. So I'm gonna talk about some of the takeaways overall from this whole journey and ongoing process. So like I just mentioned, I think it's really important to talk to the users. I'm not sure about your teams, but I think it's easy to work in a silo and just kind of like work on the projects that you want to do and feel like are right for the users, but it's really important to take their feedback into consideration. But a caveat here is you should sometimes take what they say with a grain of salt. One thing that we encountered when we were actually surveying users on what permissions they need and how they want the structure to look at. Sometimes what they were telling us didn't actually match the data that we were looking at. So do incorporate user feedback, but also look at the data and fact check. And another takeaway is it's really important to implement both best practices for both RBAC in general and for infrastructure. For RBAC, one thing that we are moving towards is to actually not let people have their individual names like referenced in the role bindings. And we want to rely more on the octa groups now because the octa groups have that automatic expiration and have better governance on who is allowed to be in which group. So outside of testing environments, we really are leaning in on the octa groups. And another infrastructure of best practice, which you're probably aware of, is to use continuous deployment and have things checked into code. But we really felt the pain of how in the past, when Robin Hood was first starting out, how some of the manual processes like really came to hurt us later. Because when we were doing the cleanup of the unused cluster rules, it was really, really hard to trace back what those rules were used for. We really had to hunt through logs to see which ones were in use. And funny, but not really that funny story is when I was doing the cleanup, I actually accidentally removed a cluster rule that was still in use. And because it was not on continuous deployment, it had to be recreated manually and we had to backport it into code. So yeah, really important to check in all of your resources as code and have it on CD. And another learning that we took away, which we actually implemented when we moved to octa groups, is to have proper governance on user groups. So besides the reliability issues and the operational toil that we saw with using Google groups with Guard is that the governance of Google groups was difficult because the Google groups were not used just for a Kubernetes cluster. Like we mentioned earlier, it's just part of the regular onboarding process to get added to your team's Google group because you need it for mailing and for calendar events. But beyond that, there were other problems where people might get added to other teams' Google groups because they're interested in the events that they're interested in their team's events and wanna see it show up on their calendar as well. So it was pretty messy and it wasn't exactly a one-to-one mapping of teams who was actually in a team and that team's corresponding Google group. So by switching over to octa groups and having a better policy for who can create those user groups and how people get added and removed from those groups, we moved away from those problems. And lastly, having a way to provide, providing a way to request temporary access is very important for the privileged access which people might need only temporarily and it's not a good security practice to grant those all the time, but you still want to provide a way to users to get these permissions because they do have legitimate reasons to use those and you don't want to have too much toil in their process. So yeah, these are all the takeaways that we had and it is still an ongoing process. So I'm sure we will learn more. And oh yeah, click blurb. We are hiring, so check out our careers page and now opening up the floor if anyone has any questions. Thanks for presentation. I wonder if you can elaborate what was the timeline of adoption of this new authorization approach and how you scheduled it? Are you referring from the migration to the octa groups? Yes. Why did we start? Yeah, so as mentioned before, there are two means of, APSR allows you to have more than one authentication mechanism so it was very transparent from a user perspective. So we added the new already C-based authorization and authentication, but we continue to support the old means of doing things and we did a lot of dog footing. We adopted within our info teams and eventually onboarded like into the application team so that we know all the pains of doing the wrong things, fixing it and making it better for everyone. So I would say into and it was around like six months of migration from it when it began versus when we ripped off God entirely from our ecosystem. Thank you. Forgive me if this is a stupid question, but I heard earlier that you were based in GCP. Did you have any problems with like GCP project roles granting like additional access to users that you weren't intending on a project by project basis or anything like that? So we were using GCP only for using Guard, like our Kubernetes clusters themselves are not like managed on Google. Yeah, for Guard to be able to have certain service account to be able to kind of make those Google API calls is the only use case where we use the GCP projects. Okay, cool. Thank you. Are you guys worried about the amount of groups you're adding to individual users and kind of blowing up those OIDC tokens? So some type of like principle store versus, I guess, this kind of our back approach. Yeah, so that was definitely something that we looked into when we were actually designing the octa groups. And that's part of like the data collection that we were talking about where if like we had gone with like every single cluster role gets an octa group, then some teams that are like, why want like all of these permissions? They'd be in like, I don't know, 40 octa groups or something like that. But those are the edge cases, right? So for the majority of users, they are in their team-based octa group and that's all they need to be in. But for the edge cases, we kind of had to find the balance between like being in too many groups. So we decided there that for certain cluster rules, we might want to just have, we would have the group associated with that team like used in that cluster role. If that answers your question. Okay. And then I guess what is the process for not creating too many groups and onboarding a new group? So essentially the application teams don't have permissions to manage the role binding. So it's essentially something that we will have to do. So there's more control there. And also the octa group creation is not like, everybody in the company can do either. So there are some protections and guardrails from that perspective. But if you're talking about like the payload in the HTTP request, that is more control through the rejects mechanisms and the prefixes that you can set up so that the payload is not too large to be able to handle it. Okay. Thank you. Thank you for nice presentation. My question is regarding the fine grained access. Have you thought through or any plan for having, let's say have your cluster with different namespaces and different teams are having access to those different namespaces and within that different resources, any like fine grained access like, okay, this team has access to this namespace, only read only, within that also some resources are very restricted and stuff like that. So that's something that we would love to do in the future but for now we considered, or for now we're more focused on the cluster wide permissions because those are a lot more powerful than the namespace scoped ones. But certainly like we are aware that some applications maybe have more strict requirements than other ones. So something that we have planned but have not worked on yet. Thank you. Hi, thanks for the presentation. Actually, that previous question was basically identical to my own. I guess the follow up that I had is I'm curious how much of this was manual as far as your team having to provision access and define what the scope of access is versus how much of it could be self-service. Like I imagine now you have a lot of teams and you having to manually sort of audit each request can be pretty daunting. So that's a great question. So when we first started, as Karen was mentioning, everybody had star access, so it was not a problem. When we first started, but we looked at the data and we provided them a little more granular access but since then we have not essentially received more requests because we were a lot more data driven we were looking at past 12 to 18 months of data. But yes, it is subject to change when more custom resources are added and things like that. But because we have continuous deployment, it would be as simple as for them to submit a diff and after that everything is handled for them. And we have very fast-based rollout, especially for roll bindings that way they can get their access in short duration and don't have to wait for days or hours. So they would make a PR and you would approve it and then it would get integrated. Is that what you're saying? Yeah, like so far we haven't had to do it but that is how we envision it to be done. Okay, thank you. Just some context, I'm asking from the perspective of someone that works for a cloud service provider that is going to be building out an authorization management system on top of a managed Kubernetes offering. So I'm picking everybody's brains as far as how some of this works. So I guess last question would be, how different would you have designed things if you were designing this management system not for internal users, so employees of your company but for B2B, people whose data you might not have access to? That's a very interesting question. I would probably say not a lot would change, except you would probably start with very basic ring only access and from there people are able to kind of request access and add more permissions, but we would probably not be asking them to add permissions to like, oh, I want permissions to create my own role bindings, for example. So we would be a little more cautious there but if they want more access, they can get it but we'll be more data driven and probably revoke those permissions if it's unused for like the last 30 days, for example. So to maintain the clean slate over and over again. Okay, thank you very much. Hey, thanks for the talk. I had a question on, feel free to correct me if my understanding of this is wrong but you previously had Google groups, people managing access to teams via those Google groups. Now you have Octa groups and Octa groups are managing the binding, the role bindings and all that stuff. Who was like, how are you protecting that the same people can't modify those Octa groups and you have a duplicate, some duplication of information there too, right? Is that something you guys are trying to deprecate off Google? Like what's the situation there? So part of the reason why we wanted to move from Google groups to Octa groups, we're still using Google groups, for all of the Google things. That's amazing. But the company was kind of centralizing on using Octa groups for accessing other things like outside of Kubernetes. So it didn't make sense for us to use that as well. And as for the drift question, yeah, that's definitely something that was in discussion when we were choosing which solution. And as for the management of Google groups, it wasn't like a free-for-all where anyone could add or remove. It was, there wasn't as much control for who could create Google groups, but once it was created, then typically if it was like a team-based Google group, then the manager would be the one who's allowed to add or remove people. And as for drift of group membership, we don't have a solution for that. Something that we had talked about was like syncing directly from workdays, like team membership, but the reason that we didn't want to go for that is because sometimes that can be out of sync, and then you would have to do some manual changes there, which is why we wanted to have the Octa groups that was just dedicated for Kubernetes access. And who's managing those Octa groups? So that's managed by the security team. Yeah, okay, so it's well-audited. Yeah, and also, like, even if you create an Octa group, the membership is not granted by default, so you would still, we would still expect users to submit a request to be able to get access to the Octa group, and there will be a designated owners associated with the group itself who would be accepting the request to join the group. And in some of the critical groups and role bindings, we made sure there are more than one layers of approvals so that it could be done by some of the admins or infra people so that it's more catered and more careful. Got it, makes sense, thanks, Seth. Thanks, everybody. So that's gonna be all the questions for this session. If it's okay, if you have other questions you can meet when we step off stage and get ready for the next presentation. Thank you. All right.