 So my name is James Penwick. I'm from, I work at Oath. Oath is a house of brands many of which I think you're all familiar with. Yahoo, Huffington Post, AOL, quite a few more. And we, we are passionate believers in building secure infrastructure. This is something that we're really trying to make part of our DNA. So I'm gonna talk to you about zero trust and so it's what to start with what is zero trust. So zero trust is a security concept where you do not implicitly trust anything that is within the boundaries of your infrastructure. Instead everything and anything and everything must be explicitly understood, permitted and trusted. An example of implicit trust something that many of us see pretty much every single day is SSH host key. You go to log into a machine for the first time and it says, hey, this machine has an identity. You trust it? What do you say? You say, sure. We all do it. It's okay. Well, it's not okay. It's, but it's a problem we need to do. It's a problem that can be solved and it's one that I think working together we can all solve. And this is actually something that we have built a solution for which I'll touch on a bit later. But this is more of a metaphor for how many companies do business. We all do it. So what do we do to control, what do most companies do when they need to control access to their resources? Security is important. I need to make sure that everything is attested and it's trusted. And to make sure that I don't spew too much security ease, I'm going to define those. Asserting something, though it's a term that we use a lot, asserting something is a strong statement. For example, my name is Jonathan Bryce and I'm the executive director of the OpenStack Foundation. This is an assertion I've made. You can trust it. I wouldn't recommend it, but you could. Attesting it is the act of validating an assertion. I am going to show you. I'm going to show you my ID. I can't do that. So why does this matter? So we want to protect all these different parts of our infrastructure. And so without the right tools in place to do it, we do the best we can with the things we have. So one of those things is network ACLs. So I can't establish trusted application identity between two endpoints. So I'm going to make sure that, you know, I'm just going to go to the firewall on my network. I'm going to make sure that there's only a port is open between these two servers. Boom. There. Assault it. Except now we have an implicit level of trust because if I have network access to this host, I am therefore assuming that that permission is allowed. Well, honestly, it kind of works on small scale. But when you get to a large-scale infrastructure, right, if you're one host, talk to one host, that's one thing. But what about when you have hundreds of hosts that need to talk to hundreds of machines over here and dozens of machines over here, you now have something much larger than an n squared problem, which means that the mechanisms you use to control these network ACLs start to fall apart. You start to have sometimes we call them ghost ACLs. We start to have these abandoned and orphaned terms in our firewalls that aren't supposed to be there, meaning sometimes someone could boot an instance, and it now has a level of network access it was never meant to have. So we've created this process heavy situation that is giving us this false sense of security, but it's not actually protecting us. There's other issues, too. For example, firewall, physical firewalls, there is a limit to the number of firewall rules they can have, and I assure you, you can hit them. So you say, okay, no, I'm not just going to do that. I'm going to add another layer on this onion. I'm going to make sure that things talk to each other using a headless user. Boom, solved it. Well, headless users must have some way of gaining an identity, right? So we're going to take something that was designed to represent a human, and we're now going to use that to represent a service. Okay, well, I need to have a password, which means I need to put the password somewhere. So now I've got 200 machines who all have access to the same password, so if that gets compromised, boom, done. And if you've ever, ever hired an intern, you will learn that passwords go into GIST, they go into GitHub, they get committed into code, these are things that happen. As soon as that happens, what do you do? You change your password on every single open stack cluster around the world, right there on the spot, breaking the service. This is a personal scar. Another way of doing it is actually limited to the prior one is a shared app secret, so these can be a different form of application credentials, they can actually be something like a X509 certificate, there's a bunch of different ways, but it all comes down to the same problem, which is yet again you're taking this thing and you're putting it somewhere and giving everything access to it. So that's not secure. And even then, when we talk, if I touch back on the network and you're like, no, no, no, I've got this, my network, my firewall rules are perfect, I have this all sorted out, nothing can go wrong. And then you learn about ARP spoofing, something can definitely go wrong. Your IP address, in fact, this is, I don't know if it's happening here, but anyone from the U.S. knows, your phone number means nothing anymore, because I've gotten a call from my own phone number. What's that? Exactly. The phone number, it's just a string of texts. I've been getting advertising calls from a number of numbers. The owners of those numbers do not know that their phone number is being used. Okay. It's a metaphor. So that's good to know. We need to adopt that ourselves. Okay, so you've got shared passwords, you've got shared application potentials, you've got unnecessarily complex firewall rules that are actually creating more problems than they're solving. You've got this false sense of security. Ironically, all of this comes with like a really, inevitably comes with a heavy bureaucracy, which then tends to help reinforce this sense, this false sense of security. It's like, well, we go through a lot of hoops, so therefore it must be secure. So I'm now going to make an assertion. Zero trust security really is a less is more thing, because by having a situation where you can actually provide a secure, a testable identity, where you can provide a policy, where you can actually provide all the things you need to allow your things to talk to each other, and we can provide those things on a self-service and a dynamic basis, which actually means that we can get rid of the some of the things we've done to our physical network, some of the policy and process we've put in place. We can do business faster, smarter, and it's more secure. So the three things I'll touch on as I go here for zero trust are going to be authentication, authorization, and micro segmentation. So authentication, authentication for humans is kind of a solve problem for the most part. For web authentication, you've got different services, you've got like Okta, you've got LDAP, Active Directory, Kerberos, there's a lot of different things out there that one way or another with 2FA or not, you can use to authenticate a person. The same does not exist for services, which is the point I've kind of been building here. So how do we move forward? So Athens is something we've built to give an identity to principles, and a principle can be either a service, an application, it could be a person, as well as role based access control and policy to grant access to resources. So it allows you to give something an identity and allows you to express what these things can do in relation to each other. We present this credential, it's in a couple of forms, but the main form is it's an X509 certificate. Every service in your cloud has an X509 certificate, and for those not familiar, an X509 certificate, it's an SSL certificate that you use to run a web server. Every time you go to an SSL encrypted website, which is pretty much everything these days, they are presenting an SSL certificate, which is effectively a, they have a private key, they have a public key, and they're presenting that public key to you that's been signed by another trusted party, in this case a certificate authority whose public key is already on your laptop. And a certificate is effectively a public key and some meta data that's signed by that trusted authority. It's not just for servers, they're also for clients. And when you combine the two, you have something called a virtual TLS, which is I can say that you as the server will represent itself as a server and I as a client will represent myself as a client. We can both attest each other's identities. This is similar as a, again, a metaphor, two people walking up and showing their, their passports to each other before they do something. This is who I am, that is who you are. These certificates are also short lived, because it doesn't do very, doesn't do me a lot of good if I'm giving everything a certificate and that certificate's good for five years. Because now, if someone steals that certificate, now that it becomes a barrier document, they can go and impersonate you. We have built a system, so Athens open source, and I'm going to keep talking about Athens, we've built another system called Copper Argos, which is a testable identity bootstrap mechanism for OpenStack. And again, this is all pieces that we've open sourced. Some of the pieces we still have, I think there's a couple more pieces to open source, but we can talk about that more at the end. So this is a system, and I have another talk on this where I go into much greater depth. So if you search for a testable instance identity from the previous summit from Vancouver 2018, you can see another talk that Mujib and I gave on that. So I'm going to just gloss over it here, which is this is a mechanism that allows OpenStack to interact with Athens to procure and distribute an identity for your instances in a very secure way. The consequence of using this is that when your instance boots through OpenStack, it has its own SSL certificate, its own identity, which is automatically rotated every 24 hours, and then it also obviously automatically expires. So that was authentication. Who are you? Now authorization is what are you permitted to do? So to cover this, I want to talk about the Athens data model a little bit here. Athens has a concept of domains, and a domain is like a namespace container. And by the way, I know that there's a lot of overlap with Keystone here. It's unfortunate at just how it would play it out. Athens has a concept of domains. A domain is just a named thing. That domain under which there will be a domain can have a number of services listed under it. So you have a principle, which is either a user or a service. You have a role, which is a list of principles. So a role is a list of users or a list of services or a mix of both. You then have policy, which asserts an action on a resource. What that means is the assertion is like grant or deny. The action could be allow update to a website. Allow you to upload data, read-only versus admin. And the resource is an arbitrary string that you provide that your application interprets. So if I say that I'm going to allow Ian, I'm going to grant Ian admin access to my website on this element, inside my website when I build it, I actually make sure to call out that okay, if the user has this policy, then I'm going to permit them to do this thing. I'm just trying to be clear that the resource is like an arbitrary thing that you choose inside of your application. There's a few different ways you can leverage something like Athens. So this is an example of a centralized model. So let's say that I want to grant a service. Let's say that I have a configuration management service. And I want to grant one of the entities in that service the right to change the max heap memory setting to 8 gigs. However, I've chosen to implement this. So as the admin, I communicate with Athens, I set that policy. Then whatever this configuration service is, when this node comes in and attempts to make that change, it actually will call Athens from there and say, hey, is this allowed to happen? It gets a copy of the policy back and validates yes. Okay, this is something that's acceptable. Move on with my life. The interesting thing here is when this client connects, it's doing so over SSL. But because it has an SSL certificate that shows its identity and the server has SSL certificate shows its identity. There's a mutual TLS, there's a mutual trust. And I'm trying to think of how to express this. So there's this mutual trust exchange. Totally frozen. I totally blanked. Mutual trust exchange. But what's interesting is that the service manager is able to extract any additional policy information it needs from the client's SSL certificate. So I don't just know who you are. I know what roles you've been added to for this domain. It's being able to use that when it calls Athens to do a lookup. So that was the centralized model where every time something comes in, you're making a call to Athens. The advantage to that is you don't have to wait on propagation delays. The disadvantage, of course, is now you have the problem of you're making an extra call whenever something's coming in. The decentralized model. This is where Athens policy can actually be distributed to your nodes. So your domain admin grants John Doe access to something. So in this model, I've got some secret management system. I could say it holds keys or something. So the domain admin says, you know what, I'm going to grant John Doe access to secret X. And I'm done. I finished. John Doe then goes and makes a call to Athens and says, hey, I'd like my identity, please. Which could be in the form of a token or could actually also be, again, in the form of a client certificate. John Doe then calls that, calls this secret management system, whatever it is, which has since synchronized its copy of the policy. He passes his token. It's able to look locally, validate that he, yes, that I see that this is a policy he's been granted. I see that there's a policy that is granting him access to this thing. It's all good. And he's able to gain that secret. So the advantage here is you're not making any off-box calls, but the disadvantage of course is that there could be a slight propagation. The third model, which is something we're leveraging in our newest open stock environments, which is a federated use of Athens. So if any of you have been paying attention to the open edge MVP stuff, this is actually the model. We are working with the Keystone team to help bring, kind of bake natively into Keystone. The generalized concepts. In this model, we use Athens policy and to create a role that is delegated to another, to another tenant, to another domain. That tenant, that role, you're allowed to manage your own users within that role. When you go and you get your token and you present that to open stack, we're able to actually validate. Open stack is able to validate the token by checking the public key, because it has a copy of the Athens public key. So we can validate the contents of that token. We can then say, okay, I see there your user name is Jane. In fact, I can go from here. Admin, as you, we see that Jane calls Athens, she gets her token, which contains just her name, her domain, the list of roles to which she's been granted, one of which is going to be, in this case, we actually bake the Keystone role in as well. She calls Keystone. Now you'll notice nowhere else on this screen is there the Athens service. She went, she got her token, she's done talking to Athens. She calls Keystone and she passes her token. Keystone is able to validate, like, okay, I see that you have assigned tokens from an entity that I trust. I know, I, so therefore I trust that your user name is Jane, and I trust that your tenant name, she's passed the value OS project name is Foo. It says, okay, well I can see that yes, you've been granted access to this tenant. It then looks locally and says, does the user Jane exist, if not, create her account. Does the project exist, if not, create it. And does the Keystone role association for Jane and that project exist, and if not, it creates it. So the acts of typing open stack server list means that Jane has automatically created her account, created her tenant taking care of all of that. There's no propagation delays, there's no waiting, there's no external tooling needed, it's just you enter the line in policy, she gets her token and she's off and running. So I'm going to take a little bit of a step to the side here to kind of set up the next thing. So you have all of your identity, you have your instances, they all have their unique identity, your user has an identity, you can use a talk to open stack, you can boot your instances. Your instances come up, they all have identities, but how do you understand what is the grouping of all these instances, how do you say I have a service like yahoo mail, what are all the servers that comprise yahoo mail? So we have something called service mapper, you run an agent on your instances, that agent heart beats using mutual TLS, so the act of connecting over mutual TLS causes it to identify itself, it sends its heart beat, which includes a variety of metadata, like it's IP address, it's host name, things like that, and then it's done. By the way, the IP address and the host name, those are actually in the certificate, so this isn't even something that we just are trusting you for because we don't have implicit trust. It's baked into that certificate, so we know that something has gone and validated that the information that that host name and that IP address and that opens.guid all belong together, and that's what it's passing up. Another interesting thing is that when you delete an instance, we don't tell service mapper. If an instance goes away and if a certain number of heart beats are missed, we just remove it. That becomes interesting, more interesting. The final thing I'll call out is that you can establish something we call a watch, which is you can either call service mapper and say give me a list of all instances which have this Athens service identity. You could also say give me a list of everything in this Athens service identity, but I'm establishing a watch and it's sort of like a message queue. Every time a new instance appears, you get a message. Every time an instance goes away, you get a message. Now, how is this relevant to zero trust? Microsegmentation. Microsegmentation is I already defined it here, but I'm going to say it anyway. It's security concept where you're applying the principle of least privileges, principle of least privilege to all resources on your network. So rather than say, and this is a very common thing to say, everything in this network backplane, I'm just going to open port 80 from the world because that's easier. We're going to say everything that's in this network backplane, I'm just going to open the network, they can all talk to each other on any port. And as we covered earlier, that's intrinsically insecure. Microsegmentation is saying, no, we're not going to do that. Every single host is going to run its own host-based firewall. And it will run its own host-based firewall and only open the ports to other servers that are needed. This is where ServiceMapper comes into play. Because now we have a policy enforcement service where you define a you can define policy for a service. And you can say I need to allow all of these members of this service. They need to talk to all the members of this other service. So my front-end servers need to talk to my database nodes on 3306 or whatever the S cell port is. So rather than having to do a push to a firewall device, the act of the instance coming up and heart beating causes PES, our policy enforcement service, it sees like, oh, I see that a new thing has joined this service. It calculates the new rule set and pushes it out to all of the hosts in the front-end and in the back-end. The firewalls are all implemented via IP tables on that list. This is because neutron security groups are not something that we can, our newest open stack environment, we're using neutron security groups. But that's a VM only thing. Our legacy, our older VM clusters, we're not using security groups yet. They're all pretty old. And our bare mental clusters, we're not using security groups yet. So what's next is Athens is out there. It's open source. You can use it to provide an immutable instance identity to everything. You can use it to build trusted communication between all resources in your cloud. This is something that we are continually in the process of baking into our DNA and making our infrastructure better, faster, more resilient and most importantly, more efficient. If you're interested in Athens, it's open source. I've got the website up here. There's a variety of resources. Pull request accepted. We've been delighted if you wanted to use it and if you want to contribute things back. If you find anything wrong, open an issue. We'll take care of it. There's an upstream Slack channel. Otherwise, if you have any other questions, please come up and ask them. So, for example, let's say that I'm going to if I'm going to give someone access to, I want to give them the member role on a tenant, then the delegated role that I create inside of Athens is your tenant name, your project name, dot member. If I want to give you admin access, then I'm probably going to add you to the admin dot admin role. That one's a little bit of a weird case. I'm going to add you to the admin address. In that case, what happens is if the role doesn't exist, then there's an error. We can create the user. There's a long walk for a short drink of water. Sorry. We create your user. We create the tenant. We go to create the role association. There is none. I think in that case, you would inherit the I think you would inherit the role association. I think you would inherit the role association. I think that works now in Keystone, at least in the version we're using. You don't get to be magically rude. Just a quick question. Is the service mapper running within a container, VM, or the bare metal? Where is the dependency where the service mapper agent has to run in? You can run it in the container or in a side car container. What are the requirements for it to run in the container? You said it uses IP tables to secure itself. Does it mean I have to give it a lot of privileges? The service mapper agent just sends a heartbeat. IP tables are actually managed by a separate agent. Service mapper agent, all it does, it just chirps out and says hi, this is who I am. Rules get calculated and set down somewhere else. We have not solved the problem of IP tables on our container bare metal host yet. That's something the team is working on which is for each container being able to create a local table for the container. That's work that's in progress. One thing I'm wondering is what does that do to the message payloads? From one of the earlier slide I apologize if I didn't get the terms right. I seem to remember there was some authorization catalogue or something that was being passed across. I seem to remember also in some of the recent keystone iterations, they were very focused on reducing the size of these tokens . They were very focused on reducing the performance impact. I'll try to unpack that a little bit. When we there's the Athens token and there's a keystone token and you take your Athens identity and you use that to get a keystone token. I call Athens and I call the getTokens API. It returns a fernet token to me just a normal keystone token at that point. Future work for us does include investigating should we continue to return the fernet token or should we have it say return nothing. From there you pass your Athens token and it would validate it locally using keystone middleware. So I think if I followed right, you're reducing some of the security threat that the beginning of your slides mentioned with shared secrets and passwords and things like that by creating these short lived, very scoped tokens. Which I think you refer to as principal tokens. What is the bootstrapping process to get a principal token? I feel like there's a chicken and egg type issue. Let me see if I have that in appendix. Just one thing. A principal is a tested identity. It's a thing. Let's say that we want to boot a new instance. To boot this instance, I'm even going to cover something that's not quite on the same page. I'm going to run this on the same page. I pass two metadata items. One is Athens domain and one is Athens service. When I run that open stack client, we've added this to the client where it will run out to Athens and using SSHCA integration it will go and fetch your Athens token for you, the user. Now, when the server create, it does its usual validation of your call. You call launch instance. What happens is open stack says you have asked me to boot an instance and bootstrap it into this service. It makes a call to Athens and says is this person allowed to boot an instance into this service? Athens says yes. Next is open stack talks to an intermediary service we have called host sign D. It says hi, I'm going to create an instance. It's going to have this UUID, this Athens service and this Athens domain. Please give me a bootstrap document. This bootstrap document contains basically only that information which is just the open stack UID, and it has a TTL of about 30 minutes as its expert. That is then injected into the instance. When the instance boots where am I sorry, I'm doing it on the top of my head instead of following along on the numbers. When the instance boots it has something called the service identity agent, we call it CIA. It's running on the host. It takes its bootstrap document and it says hey, I'm a brand new instance. I would like my own X509 identity. What it does is it creates its own private key locally on the instance. It calls Athens, please give me a certificate. Here's my public key. Here's a CSR I've just created. Athens calls back to open stack and says hey, I just got this UUID that came out to me and asked me for an identity. It has a valid bootstrap document. It says I have a test these assertions that were made in its request because it says it has this host name and these IP addresses. The reason that we don't put that information to the host document is because open stack doesn't know that yet when it goes to boot the instance. This way the instance goes, it boots, it figures out its own IPs, it calls Athens, says give me a certificate. Here's my CSR, I've listed all my IPs in it. Athens validates all that information and says okay, this is all legitimate. It then mints the certificate, passes it back to the instance. So now the instance is up. Oh, yeah, I glossed over. The certificate when it's signed, it's signed by an HSM that's in like a Faraday cage and all sort of cool cloak and dagger stuff. It has its certificate. But then probably the next question is like okay, how long is the certificate good for and what do I do when it expires? So the certificates right now are valid for longer than what you'd like and this is because we're trying to build some trust in the system and get people familiar with it. But the nice thing about this is this is a knob we can turn down because they refresh their certificates every single night. We can change that to as frequent as we want and they're good for many days and the reason for that is that if anything happens while we're still kind of expanding and turning the knob, we have time to address it because once your certificate is expired, that's it, you're done. You need to go and reboot Strap with another manual process to kind of get that engine started again. But okay, what happens every night? So your instance every night, it's pretty much the same thing. Your instance creates a CSR. It still has the same private key although we are going to be adding a feature to change private keys because why not? So it will be, but that's not happening right now. So every night it runs, it creates a new CSR. You can also manually run this if you've gone and let's say you've added an IP to your Neutron port and it's on your box and you want to make sure that takes effect. Not a problem, you just rerun Cia, it goes and fetches a new certificate and everything happens from there. So it creates a CSR, calls Athens. This time it's using mutual TLS with its current certificate but it requests a new one, or signs a new certificate, passes it back. It then also records the serial number of that certificate and it keeps track of it for the validity period of the certificate so that no one else could come in and grab an old certificate and say I'm going to go ahead and request a rotation. So that problem is also taken care of. So the question was, am I effectively taking the user principal identity and through this process and through this chain of trust granting an instance its own identity? I think from one perspective, yeah, I think that's succinct. I have the right to declare that this thing should be granted an identity in this domain and at that point I'm out of the loop. In the same way that someone at the U.S. passport office has the right to say okay, I've looked at his documents, I'm going to stamp these, he now gets a passport, now I have my own identity document. This, at least at a high level, seems to overlap with what NovaJoin is supposed to do. Could this be implemented as sort of a you know, NovaJoin plugin? Do you know? Have you even looked at that? I honestly don't know. Someone said NovaJoin to me yesterday and that was the first I've heard of it and I haven't had a chance to look it up so I have to admit I don't know. So once you've done this bootstrap and the instance has an identity, do you typically limit the roles and permissions that that identity is granted? And then it uses it's like, let's say you allow those to live for a day, it uses that initial identity token to then request further more fine-grained permissions and then those finer-grained permissions you could have shorter TTLs like five minutes. That's actually a really great question. So yeah, if you have your identity and then you can actually, in fact you can use your identity, your authentication principle to request a role token or a role certificate which is your authorization certificate and that is a different TTL and you can also specify a shorter TTL. Now I will note that it is not possible at this time that I'm aware of maybe the team added it and I didn't know to say like oh, anything requesting a token from this role? Like if I'm like in my case I have a domain that I'm delegating roles out of, I'm not able to say any token you grant from this role can only be valid for five minutes. But your question alone makes me think that would actually be a really good that would be a really good feature. Micro segmentation seems like obviously a very good idea but it's fiendishly complex to program at distributed scale are you addressing this problem or? Actually yes we are so we have something called the policy enforcement service where we can create policy and say okay it actually is somewhat a robber's because it actually uses Athens policy or a mechanism similar to Athens policy where it we have the ability to say okay I've got a workload group and the workload group is effectively whenever the list of servers and that workload group is populated by service mapper so that's how you get a list of all the host names and IP addresses. You then have a policy object which says I'm going to allow this workload group to talk to this workload group on these ports and these protocols so TCP-22, TCP-80, UDP-53 whatever so when membership changes in the workload group we then are able to take this pulse and we have a system that goes and actually calculates the IP tables rules and distributes those rules out through a sort of I'm going to use I'll say the words configuration management system though that could be misleading we have a client server system specifically designed just to distribute and implement these rules so the IP tables rule sets get pushed to all the impacted hosts they get a push they pull down their new rules they apply them to IP tables as long as my memory isn't too rusty here we actually create I think write a new table and then we flip that table in so it's an atomic swap and then we keep the previous table in case anything went wrong we can swap back I think actually I'm at time so yeah sorry folks I was paying attention alright thank you