 OK, we'll go ahead and get started. Thanks, everyone, for joining today. Today we'll be talking about role-based access control, or RBAC. And we'll assume that because you're here, you already understand why RBAC enforcement is useful and why it's necessary for your cloud. And so we won't spend time telling you about its benefits, but we'll focus on how to safely deploy RBAC in a production environment without disrupting your users. And please hold your questions till the end. We'll have about five or 10 minutes for questions. So this is us. We all work for Symantec, and I'm Brad. I work primarily on Horizon. This is Karthik, and he works on our SDN solution, and Tim, who works on Keystone. This is our agenda for today. First, we'll talk about just why it's so hard to implement RBAC. Then we'll go through modifying the neutron policy, then the designate policy, then modifying the Keystone policy, the horizon changes we had to make to make all this work better, and then the lessons that we learned along the way. And before I go further, there have been some changes merged into Meetaka, and going forward, policy enforcement will become easier. There's a lot of people working in the OpenStack community on making this simpler than sort of what we had to deal with and what you have to deal with today. But assuming you're like us and you have to secure your cloud now, this is what we went through for it. So when you first think about implementing RBAC, it may seem like a very straightforward thing. You just make changes to your policy files, and then you roll them out. But when you think about it into the details, you realize that there are a lot of things to think through in detail. So for one thing, just the number of APIs. And so this isn't even our full list of spreadsheets, but here we've got on the left side of the spreadsheet is an API that needs to be secured. And then on the top row, we've got the role that needs to be secured. And so each of these squares represents a role that we had to think about and what APIs that role should have access to. So just a number of APIs to think about is enormous. Then the next thing is the dependencies between components. And this isn't our real dependency graph, but any OpenStack production environment will look something like this. Nova depends on Neutron Glance and Cinder. Neutron depends on Designate. Glance depends on Swift. Horizon depends on everything and needs copies of those policy files. And everything depends on Keystone. And so if you make a change and a user suddenly can't access one of these pieces, you can have a large breakage chain, very difficult to debug along the way. And the next thing is, this is our production cloud we're doing this in. And to a user, if we break them, can't access Keystone is the same as the cloud is down. Your customers don't wanna be QA for you. And so if you have a test environment and you make your RBAC changes there and want customers to test them out, that's not gonna work well. They have other priorities and testing your RBAC changes is not high on their list. So we need ways to ensure we secure the cloud without complete testing from our users and without breaking what they're currently doing. And next I'll turn it over to Karthik to get into some details. Thanks, Brad. So before we look into Neutron policies, brief overview on how we handle roles at Symantec. So if you've done any sort of OpenStack deployment, you're probably aware that OpenStack comes by default with the admin and the member role. But that doesn't really work for us. So we have three roles effectively, admin, operator and member. And we have different service scopes. So for Nova you have compute scope. For Neutron you have network scope. For Glant you have image scope and so on. So you combine these and you have compute admin, compute operator and compute member for the compute scope and so on for network and image. We also have a special role called the service role. This is intended only for when one service wants to talk to another service and it's not intended to be given to users. Right, so implementing Neutron policy, we deal with the network underscore scope of roles. And the biggest reason why we needed this sort of policy implementation is if you give somebody the admin role on a project, they get these magical superpowers that they can do any sort of operation on any tenant, even outside their scope. And this breaks your control plane isolation and we didn't really want that. So we have the network admin role to tie down the scope to a project. They can still do all the admin tasks, but only within their project. Specific challenges with our implementation. We run a few, all the release of Neutron. So we only have like a very small subset of the policy language to work with. Neutron also gives you per attribute access controls. So why, this is like a double edged sword. You can fine tune your policies but you also have a large set of APIs to work with. And of course, you have to deal with dependencies with designate no and horizon. So if you're trying to come up with a policy in Neutron, make sure that it sort of plays well with designate no and so on. And backward compatibility, we decided not to really do a backward compatible policy file. We did something else, I'll talk about that later. And any work we've done on Neutron, the policy file is available on that URL. We also have like other tools and utilities and that's available there as well. So I know that's a very large spreadsheet, but bear with me. You have all the API calls on the left here and all the roles on top. Anything that's in green means you can access the API. Anything in red slash pink means you can't access it. This has everything from all the attributes, LBAS, security groups, the whole deal. So if you zoom in a little bit more, you can see it a little better. So let's take a sample. The first row there sort of gives you the get network API call. Pretty much everybody can do that. But if you take something like, I think it's there, create floating IP, only admin or operator can do that. So moving on to validation and testing. So the first thing you want to do is validate your policy file. Make sure it's valid JSON. Make sure you don't have any typos, rule versus role. It's very easy to get that wrong. Make sure you have all your macros correct. The next step is validate your policy file. Make sure you account for every role, every API and every attribute. Once we got done with validation and testing, we figured users are going to be asking us, hey, can I access a policy, can I access an API in this project? So we came up with this tool called check access. So you give it a policy file, you give it a username, you give it a target project ID, an optionally an API method, and it'll spit out whether it's allowed or not allowed. And finally, policy deployment. So we deployed the policy file on all regions at the same time. We also coordinated with Horizon. So this ensured that we have a uniform user experience across all regions. And it's a uniform user experience, whether the user accesses neutron over API or over Horizon. But before we deployed the neutron, we actually had to go through this migration step. So migration means over here, we gave users with an older role, we assigned them a newer role. So for instance, anybody who had admin got network admin, and anybody who had member got network operator and so on. We have a replicated keystone in our setup. So this meant that we actually had to do this migration only once on one region, and it was replicated across all regions. So that's it with the neutron on to designate. So when we started off with neutron, we really had no idea where to start, and it kind of took a little while to get used to it. For designate, we'd gain some valuable experience during the neutron implementation, and that really helped. It made the implementation a little bit more straightforward. We again deal with the network underscore scope of roles. Biggest challenges when doing designate is this all tenants flag. It's used throughout the code and designate. So if you're trying to design a policy for a certain domain or certain record, this all tenants flag is going to get in your way, so keep that in mind when you're implementing this. Also, we had to support both V1 and V2 designate APIs, so this made the subset quite large to deal with. Backward compatibility, we decided not to do anything with backwards compatibility in the policy file. Instead, we did the migrate existing users trick. However, this time around, we didn't really need it because we already did that during the neutron implementation. The policy file again is available at that URL. That's how the policy matrix for designate looks like. Same deal here. The first column refers to the APIs. The first row gives you all the roles. V1's on top, V2's at the bottom. And if you zoom in a little bit, the first line there gives you all tenants flag. We only do that for the cloud scope, so only cloud operator, cloud admin, and cloud member can do this. Everything else is as it's there. We had to make some code changes to designate to support this sort of an RBAC implementation, which we hope to send it upstream soon. That's it from me or to Tim for Keystone. Thank you, Karthik. So to give you some context, the way that the cloud works in Symantec is that we have the cloud team that maintains it, and then all of our end users are the different product teams within Symantec who want to deploy onto our cloud. So since we're all in the same company, we actually have two very competing goals. One, we want to be able to give our end users as much power as possible so that they can do whatever they need to do, whenever they need to do it, without having to file a ticket and wait for us to get to it. But the other side of that coin is we owe it to every team that we never give away so much power to an end user where someone on Alice's team can, for example, accidentally step on something in Bob's team. So to that end, we identified nine unique roles from Keystone's perspective. We have the three levels. We have the cloud level, which is intended for the cloud team. We have the domain level, which is intended for the leaders of the product teams. Then we have the project level, which is intended for most people. On each of these three levels, we have the admin role, which is full crud, create, read, update, delete. We have the operator role, which is the same except without delete. And finally, we have the member role, which is read only. Now at the bottom of the screen there, you'll see a link to our Keystone policy file available on GitHub. Please keep in mind it is based on the kilo release as is this spreadsheet. So you've seen this before. The left is the APIs. At the top, you have the nine roles just discussed, plus the service role. And everything in green is something that we intended to be able to give away. Unfortunately, reality wasn't quite so simple, so we had to come up with a way to track what we originally wanted, what we actually have, and why the two were different. So we just have a simple color scheme where the red represents functionality that we intended to give away, but we couldn't see a way to do so while still maintaining the strict security between domains. The blue represents functionality that we didn't originally intend to give away but was required in order to enable functionality that we did, and then the gray represents functionality that we wanted to give away but could not do to an open bug at the time. So our very first plan, it involved backwards compatibility. That's the big difference that you'll hear about the experience with Keystone as compared with Neutron and Designate. We made the conscious choice right at the start since Keystone is so central to everything, and since this is our real production cloud with real people really using it, we would be happy to spend the additional time and effort so long as it made the process safer. And it seemed like backwards compatibility would be a good way to do that. And so the plan would be, you have the original policy file that only respects the old roles, then we add in all the new roles with new rights, then we test it very, very carefully with a keen eye on ensuring that the old roles, the functionality doesn't change at all. Once we're sure of that, we can deploy it into production, and then the plan was to approach the individual product teams individually at a time that's convenient for them and say, hey, we should switch you over to the new roles. You can verify that you still have everything that you need and it still works. Then we would iterate on this process going from team to team to team until either we would have a high enough degree of confidence to batch do everyone else, or until we run out of teams. At that point, we could take the final policy file that only respects the new roles and deploy that and we'd be done. Well, there's a couple problems with this plan. The first being, now you need this weird interim hybrid policy file that knows about both the old roles and the new roles. And it seemed at the time, like there ought to be a way to write this policy file such that going from it to the final version that only respected the new roles, there should be a quick simple way to do that and this is our attempt. So the idea is at the top of the policy file, we have macros such as admin or cloud admin which refers to both the old and the new roles and then so long as in the body of the policy we only refer to the macros and never the roles directly then in order to go from this interim version to the final version, it should just be a matter of removing these macro definitions and replacing all of the references to them with the new role only. Why this didn't really work in reality is because especially since we had the goal where the old functionality cannot change, what we found was that we were bringing ahead a lot of just cruft because things used to be this way, they are again now since we're trying to use this simple plan. And since we were spending so much time rewriting the entire policy file anyways, it really seemed like we should end up with the policy file that we actually want and not have a bunch of stuff that's there just because. And once you pick all those nits in the end, you're really doing two complete implementation cycles. And of course you have two complete testing cycles and then two complete deployment cycles. Now like I said, all of this is fine. We made the choice so long as it makes the process safer to spend this extra time. And that's the fatal flaw of this first plan is all of these steps, when compared to the plan that we eventually succeeded with, they don't gain any safety at all. So here's what we ended up doing and it worked very well. We have a migration script available on GitHub and what it does is it iterates through the existing role list or the role assignment list. It finds the four roles that we really cared about bringing forward, admin in domain, admin in project, member in domain and member in project, and then it would additionally assign the equivalent new style role. So if you previously had admin in domain X, you would run this script and it would give you also domain admin in domain X. Now a key point there is it only adds the new roles. It does not remove the old roles. So once you have this setup, it turns out that you don't need that interim policy file at all. You can go directly to the final version that only respects the new roles. And so going forward, what happened is our QA team in our test environment created four individual users for each of the four roles we wanted to migrate. Then using the test framework called cloud roast, for each API listed in the Keystone policy, they wrote one or more tests. Then it's simply a matter of dropping in the old policy file into the test environment, running the complete test suite across all four users and getting the big list of passes and fails. Then it's just a matter of dropping in the new policy file, running the exact same tests. And then once those two lists of passes and fails match exactly, then you can have a very high degree of confidence that from the end user's perspective, backwards compatibility has been achieved. The other advantage to this plan is that deployment is extremely simple. You run the script in production, and since, again, it only adds roles, it never removes them, almost no one is going to notice this at all. And then deploying the policy file, since any modern version of Keystone does not even require a restart, there's no disruption. So quite literally, other than your announcement to the users that at this date and this time we're going to deploy Keystone RBAC, and then your subsequent announcement of Keystone RBAC has been successfully deployed, this entire process should be completely invisible to your end users. Now, the last thing I'd like to talk about is one of the goals for Keystone RBAC is we saw a situation in our cloud that perhaps you've seen as well, where occasionally we would see end users with roles that we didn't really intend for them to have, and worse, sometimes we would see end users with roles that we never intended for any user to have, such as the service role that Karthik mentioned earlier. So one of the goals of Keystone RBAC is we had to lock down role assignment, and this is how we did it. So at the bottom here, you can see that both Create Grant and Revoke Grant have the same rule definition of can affect grants, which is right here, but ignore it for a second, it's made up of a couple of sub-rules. So at the top, you have can affect domain level grants, which is very simple, do you have domain admin on this domain? And then we have can affect project level grants, also very simple, do you have domain admin on the domain in which the project resides, or do you have project admin on this project? Now here, this probably represents the biggest unexpected challenge that we encountered in that once a rule gets complex enough or long enough, it becomes significantly challenging just to read, let alone write or debug. I think if you take this exact text and wrap it at 80 characters, you end up with four or five full lines. And depending on how your rule is incorrect, like perhaps you have a mismatched number of parentheses or you misspelled role, often what can happen is that your only feedback is Oslo.policy throwing the error that it can't parse your rule, which is a perfectly valid error, but from a rule implementation perspective, it doesn't really give you a great clue as to how to go forward. So the way that we overcame this is we found if you take that exact same text, copy it into a different file, and then just pepper it with a luxurious amount of white space, it becomes extremely simple to read. So can you affect grants? Are you a cloud admin? Of course you can. Or so long as you're not trying to affect cloud admin and you're not trying to affect service and you can affect domain level grants, go right ahead. Or if you can affect project level grants and you're not trying to affect domain level grants, go right ahead. Now even being very familiar with these rules, I still have quite a tough time just reading this version, but I don't think there's anyone in this room, even if I hadn't just walked through it, who would have difficulty gaining a complete understanding of what this does in a very, very short amount of time. And with that said, I'd like to pass it back to Brad to speak about Horizon. Thanks, Tim. So I'll next talk about the Horizon changes that we had to make to just to make this work better for our users. So many of you may be familiar that Horizon uses the policy files from the services to selectively hide and show content, like hiding buttons and dashboards. And so when you make policy changes to the services, you should really make those same changes to the corresponding policies in Horizon. And that just avoids issues like the user has access to this API, but they don't see the button in Horizon that can cause a lot of confusion. And we've got a link to our Horizon policies up here. The Horizon policies were just a little bit different from what we use for the service policies just because there are certain things we wanted to not show in Horizon specifically. And we also found that code changes were needed in Horizon, especially if you're changing rule names. So this is one example where we, in the Keystone policy, we changed admin required to admin role. And we did that in Keystone since we had now project admin, domain admin, cloud admin, and we wanted a better way to refer to those things. And so this is just one example where the hard coding of the rule names in Horizon, we had to make a code change. And the next thing I wanted to go through is Horizon and using it with domain RBAC. We use Keystone V3 in our cloud and we use it with domains. And Keystone for a long time has supported the concept of a domain admin where this domain admin can create projects, anything they need to in their own domain, but they can't do it in other domains. Until just the Mitaka release of Horizon, Horizon didn't support this concept. And so a domain admin, you couldn't restrict a domain admin and allow them to do things to Keystone via Horizon. And we've got a link to the change here that some of us in Symantec worked on that merged this support. And then a link to a blog post about how we use this support in Symantec. And this is what allows the screen that you see here where a domain admin is logged in to their own domain. They can see info about their own domain only, but not see others domains. And then they can affect projects in their own domains, but of course not in others domains. And we found that when working with this, starting out with the V3 sample policy file from stable Liberty Keystone was the best place to start. The stable Mitaka policy currently uses some things that aren't fully supported by Oslo.policy. And so start out with the stable Liberty policy to get started on this. So once we had made the changes to the services and the changes to Horizon, we went through some amount of growing pains with the users, just understanding what they needed to do at this point. In our case, we decided to change one service policy at a time. And that just made it easiest for us to minimize issues with the dependencies between services. And so we made our Keystone changes first. And so at that point, the Keystone had roles like domain admin, domain operator, and so on. But we hadn't changed Nova yet, and Nova still needed the admin and member roles. And so users, if they wanted to do Keystone operations, needed the new ones, but needed the old ones for Nova. And so in many cases, they needed two different roles or more to do everything they needed to do. And another thing we saw was some issues that manifested in ways that we didn't expect. So after we made the neutron policy changes, we made it so that non-admin users couldn't get attributes on shared networks and other tenants. And when they went to the network topology page after that, what they saw was just forever loading. And we would have expected that if a user suddenly lost access to some API, that they would see some error message show up in Horizon. But what we really saw was just loading forever page. So it took some figuring out that this was even related to our back, and then of course debugging and digging into what was actually going on in this case. Next I'll go through lessons we learned actually going through this in production. So one of the things we did very well was giving everyone the new roles before disabling the old ones. And both Tim and Karthik touched on this, but we had scripts that went through and looked at the capabilities they had with the old roles and then gave them the new roles they would need to do the same things before we made the switch of the policy file. And this is absolutely critical, because otherwise if you make a change and you break a lot of users, you'll be going through things forever solving these API things. Also communicating to the right people. So in particular, the users who had access to manage roles for other users was very important to document things for them so they know what to do and document the specific role they need for a specific capability. Then tell the users to read the docs and remind them to read the docs and then remind them to read the docs again because they will forget and then they'll be blocked and having issues with something that's already documented. We found that changing one service at a time did lead to confusion, but it was worth it. So again, this is the way that we avoided a lot of complexity with dependencies between services. You don't want to debug an authorization issue after changing multiple policy files and multiple services. Users could be down for a long time while you figure out exactly what APIs are they no longer have access to. And we found that focusing on Horizon as well as the CLI for validation was very important. So when we were testing, we would make the changes to the service policies in a test environment and then also make the corresponding changes for Horizon in that test environment and then test things out that way, make sure everything works for Horizon as well as the CLI. And then once you make your changes, many users are probably using Horizon and they're gonna come to you with questions like, why can't I do X via Horizon anymore as opposed to why can't I access this API anymore? So they'll get some generic Horizon error and someone is gonna have to be available to look at dig into Horizon, dig into the logs and see what exactly is started failing for that user and what they need to get going again. And so we've gone through how we planned out these changes and the things that we went through as far as implementing them and then what we found as far as when we did them. And so now this is a good starting point and you can go ahead and secure your clouds. Thank you. And so we can take some questions at this time. Their microphones, please come up to the mics. And also wanted to mention that we are hiring at Symantec specifically for open stack and cloud roles. So come to our booth at booth D8. I was, so when you, I think you guys said that you did this, when you were first doing it, you were doing it on Kilo, is that correct? When you were first doing the role-based changes? So when we started Neutron, we were running Icehouse. Icehouse, yeah. So we were running Icehouse in Juneau. So when you made the changes in the actual Horizon code, how are you making sure that it's changing as you do upgrade? I assume you've done upgrade since Icehouse. So have you changed to Liberty or Mutaka? And are you back putting those changes back into the code base when you do an upgrade? As far as when we did this, we were on Kilo Horizon already. So it was some of the services were still Icehouse, but we were already running Kilo Horizon. And as far as the code changes that we needed, they ended up being very specific to our policies. And so there may be ways to make things more general. I think in general in Horizon, we need to get away from hard-coded rule names. And that's one reason we haven't done a lot of the rule changes is because of the hard-coded rules in the code base. Yeah, so those haven't been contributed back, but I think that would be, it'll be a little bit of an effort to do it in a way that's generalized to everyone's policies. Thanks. I saw you did some work like put some restriction in, for example, the Keystone part, the policy. And instead of just have the regular admin role, you also put project level, domain level. So in those case, I'm just wondering, are you able to say, restrict the user of domain A? If you assign admin for domain A, and this admin should not be able to do anything in domain B, for example, like project list, right? Typically, project list require admin. But if you're admin in domain A, you may also be able to list project under domain B. So how do you achieve that to isolate these two things? That's a great question. The problem that you're bringing up is exactly why we had to get away from the generic admin role. And that's why we have the specific levels and the specific isolation. Could the way that we have it set up, and please check out the policy file that's linked in our presentation. Exactly what you said is solved. People, someone, a domain admin in A has no power at all. They cannot touch, affect in any way, domain B. That was, that's everything that we wanted out of this. So you mean based, to say, for example, the KLO version, so with the KLO version of the policy file, you are able to achieve that. So you just manipulate the policy file, and if you assign user as admin for domain A, and he or she will not be able to touch domain B. Yeah, exactly. So you're able to achieve that. Yes, please check out our policy file. As compared to the default V3 cloud sample, there's a lot of great functionality available in the policy language, and I'd love for you to check out what we really did because what you're asking, it's true, we did that. Okay, so in the meantime, if say at runtime you create a new domain, so how are you going to handle that? You want to create a new domain, and you want to create a new admin for this new domain. Do you need to first go to the policy file to make change? No, the policy file is generic. There's no named domains at all in the policy file, but the way that you do it, it would have to be a cloud admin or a cloud operator who to create the new domain, and then it would have to be a cloud admin or operator to assign the first domain admin for that new domain, but after that, that domain admin can set his own description, they can do everything that you would expect only in that domain. Okay, so you have like a generic role, so we can assign to the new domain, and this new domain admin will automatically take this generic role. Yeah, all of the keystone roles, there's nothing tied to any specific domain, it's all genericized. Okay, got it, thanks. Thank you. All right, if there's no more questions, thanks for coming, please check out our booth.