 Good morning. I can't see all of you, but I think I hope you were at good parties last night. I heard they were great I saw seeing pictures on on my WeChat group in my team Hello, good morning. My name is Subbu Alam Raju. I haven't had coffee this morning So I'll try to speak loud and and stay energized looking at all of you. I know it's a last day of the summit and Keystone is probably not the most exciting thing happening in cloud. It is probably the most boring projects that you may consider And if it is boring, I wouldn't talk about it. I'm sure you wouldn't be here and and and so So I'm going to spend some time on what we how we use Keystone at eBay What have we we've been doing to make it to make it really work for our cloud And the reason and why we took the approaches we took The hard work for this talk was done by two of my colleagues Davi Ding And who is the the lead for our IAS layer And and thanks to who's leading the identity and access management Project at eBay and these guys could not be here. So I Took over that the talk and I did do my best in in giving you a Staying truthful to the topic So why Keystone at the center of the universe? I think what has been happening In at eBay and I'm sure it's happening throughout the industry is that Control plane is getting more and more important than data plane If you go to the traditional enterprise maybe five years ago, you know, it's all about Filing tickets and spinning of VMs and then you forget about control plane after that things are running Now we are getting to a time where Actions are happening. The cloud is changing very often The traffic on the cloud is increasing on the control plane apis are getting more and more important because Availability security and provisioning all these things are happening programmatically And to make this work at scale You need a rock solid control plane and that's the realization a lot of us Have come to in the last few years. That's why you see folks time a lot on ha for the control plane more than the data plane So What is our what does our universe look like before I actually talk about that I want to make a couple of disclaimers one is that The stuff we did is somewhat dated we Some of the stuff we did is based on Havana of open stack And we upgraded later on similarly Some of the enhancements that we made to keystone for identity index management are still internal We have not yet been able to commit the blueprints and then push up steam Hope to get that process started Soon the team is still busy Getting up to Getting productizing that that those are the features we did. So that's my disclaimer So what does our universe look like? as I said before The control plane is is extremely important for our business at ebay. For example, a developer can come to a cloud and say hey Deploy my new app with this C name Make it available everywhere and that action takes about 30 minutes. The code is running everywhere That means A lot of provisioning access activities that happen throughout the control plane not just the compute open stack layer But other platforms that exist So I want to go over how our universe looks like Why keystone is getting more and more central to what we do We have this notion of availability zones. This is pretty standard in every In every cloud deployment today and all of us most of us have More than one deployment of open stack And we took this very seriously in the sense in our deployment in our data centers We have a number of these availability zones Each of these zones are Totally decoupled from one another Each az is a full open stack deployment So we have done the complete automation of the of the cloud itself So that I could once the the metal is provisioned once the network is racked and cabled I can use some scripts to bring up a new az And the idea of an az is a coarse grained fault domain That is if something goes wrong with one az it is not impacting another az so that is a promise we try to maintain For for the business So if a an application wants to be available and easily into failures They are expected to provision across multiple availability zones So that's it's earlier. We used to think of a rack or a half rack as a fault domain Now we have moved away to a much larger coarse grained Fault domain, which is in availability zone. So when you have these multiple availability zones Where can I find the cloud? Is it a common question? Like should I which endpoint should I use? What is the protocol to use? How do I log in? Do my credentials work everywhere What is where like now you have many many services in in our service catalog and you have to know the endpoints to use them And and a lot of provisioning activities happen across az Because they when they are multiple azs you want to deploy code Spread the code around Application around that means you are orchestrating Whether it's load balancers or dns or provisioning You're spanning these operations across the globe And across all of azs And you want to keep that consistency and you want to maintain users and roles Consistently across all these azs and those are the problems that we started with lately last year The second complication that that adds To this is the notion of a vpc We are promise our fundamental principle of cloud building clouded ebay is to Treat infrastructure as a shared resource and not a dedicated resource That means whether you have dev test workloads or Or a production workloads of different kinds even when we say production There are different kinds of production for different with different policies and and and expectations So we modeled we borrowed this term called virtual private cloud from amazon way back in 2012 But we went beyond Amazon's definition of vpc which is amazon's definition was It's a it's an isolated segmented network Network segmented zone within amazon's public cloud infrastructure But we go beyond that and define policies of who can do what where And whether a person can do these actions and and are not those policies we define Uh in the context of a vpc. We don't have many of these vpcs There are about 10 vpcs, but but they are all have the same services. So When we launch a service a feature and it's there for dev. It's there for prod. It's there for everything It's not that we launched the devs some some feature for dev only and something for prod only that we want to maintain that consistency So that when somebody builds a platform that's running on cloud infrastructure Want to make sure that they get the same expectations if it works through some apis in for dev test mode. It has to work Provided you have the right policy you confirm to the policy in production So that's that's how we approached cloud And they have different out and access policies For example, if i'm doing dev test, I might just use a single factor authentication, which is my corp login and password But if i'm using a a mission critical application that is That is important for the business that change may have to may need two factor authentication So it's not we have to do that support and more over there are cases where They are machines doing operations. It's not it's not a human being So your corp credentials are two factor authentication don't even work So we need to support all that flavor of use cases when you have This is such a multi-tenant cloud infrastructure And what's make even more interesting is that I think some of you who attended ebay talks before A new should know this is that we have taken an approach where our apis are open for all our developers And and we want to make sure that there is one way to do authentication for all cloud services not Five different ways because first of all, it's not secure if you everyone is collecting credentials And you want to make sure there's one entry point for Authentication one way to generate tokens one way to revoke them consistent policy and So as we started adding more and more platforms on top of our basic cloud primitives, which is compute storage and a network we My started migrating Other platforms onto the same model. So we have arrived at this model cause a tested control plane Which is the set of apis services You use on the site for provisioning software deployment monitoring and remediation So all the apis the complete life cycle of operations that you do for cloud Are part of this tested control plane and and they are apis for provisioning like the like the pass layers We have built homegrown accessor service for different use cases like you know elastic search and and and Caching and things like that Some of these may end up in open source. Some of these may not we are still going through developing development and and production productizing the those services these kubernetes we are investing in kubernetes as a layer as a cluster management layer on top of open stack primitives And that too needs credentials and and have a consistent policy of access control And there are a ton of other homegrown cloud services that we have in every way And there is headless access people that don't have their machines operating With the cloud infrastructure of the with the apis. So all these layers exist above Around our keystone infrastructure So our control plane started with the three four services three years ago. Now it's it has many many services It doesn't even fit in half a rack anymore It includes object block computer network and and all these other services And in addition we have services below the the cloud which are operational services They also need the same kind of things like who can create networks Who can onboard networks who can provision hypervisors who can decline her for hypervisors Who can evict hypervisors the whole bunch of tools and operators using it? So again, we want to make sure there's a consistent policy and Of identity and access management So that has led to this model that this this realization about about a year and a half ago that We got to have keystone as that center of our universe As which is has to be global That means if I create tokens in one place I need to be able to use the tokens everywhere I should have one set of policies that are governed and managed in one way And they should be only one entry point for authentication and access control for the cloud not five different ways of authentication It should be available That means because every operation in the cloud Is using is authenticated Except the part of getting token itself You want to make sure that I didn't I am itself is available globally and it's secure for obvious reasons Without actually managing users because we are still not managing users There are other systems in the company that manage users. There's corp held up. There are other held up systems that are managed By other teams that have their own policies of how they store and rotate passwords and all that We don't want to get into the business of doing all that work. We still want them to do the work Same thing for two-factor authentication We don't manage two-factor authentication, but there are systems that manage two-factor authentication So without actually managing users we want to have I am built in ad ebay And we have untested cloud users And we have semi-trusted cloud services in the control plane. So we want to address all the use cases. That's how we started That's how the keystone became So so important in our in our in our stack. In fact, if you please so is down cloud halts Every every part of cloud halts. I mean VMs keep keep working, but how good is it if you can't change it? So we did three three important changes to the changes to the Wave we deployed deployed cloud. For example, when we started our initial open stack deployment, there was one we didn't care One one az we didn't care much But as the number grew over time This project became these activities became very very important. So the three things are a global keystone And the notion of a tested control plane Then the second fact authentication And the third is a set notion of api keys Those are as I said before extensions that we made in-house. They are not public. They are not yet being contributed upstream We are we have to work. We have some work to do on those respects So let me go over The how of of these these features and then I can open for questions The design that we used for keystone is fairly straightforward as I said before when before we thought about this we had multiple we had already launched a couple of availability zones and There are already users and projects and whatnot On those in those availability zones you is using its own keystone instance That means we had to work out a a model where We can migrate the data without impacting our our existing users and projects and their resources That took in fact the longest time in in this process So the design was fairly straightforward and and and and repeatable At the at the bottom of the picture you see the you see the databases which are mysql instances and The way we set up is that we have And number of these availability zones each has a load balancer. Sorry a mysql cluster And we use Galera to to replicate certain tables Across globally now there are there are some trade-offs in this process the first issue was We started with the assumption that We only need These two tables like users and roles Let's see what happens. We start digging through the code and looking at all the dependencies and and We wanted to set a design constraint that we don't want to touch tokens We want to leave the tokens where they are and see if it works and then we went it went into the code and realized Oh crap There is tokens everywhere and and it's uh It's so strongly a couple such that you have to replicate tokens and there are why tokens are a problem because As you access control plane Not every client is smart enough to cash tokens So they just forget about They just ignore what they have right now and they create a new token every time and they do try to do something And so there's a lot of traffic for example, uh recent peak we had was about 8 million tokens in a single day And we were surprised. Oh, this is this was not expected. We were not expecting that much traffic on the control plane These were fresh tokens minted on a given day That was a peak recently and that's been growing over month over month Since we launched early this year So each mysql cluster Has a load balancer whip and on top of which we have a dns We use a commercial product for for for the for the dns entries So they have a dns whip but And then on top of that we have the clusters of keystone servers in each in each az And and the dns is configured such that if an az is looking for mysql It'll find the the nearest which is the local Keystone whip it'll it'll do the reads and writes from that And and the model repeats on top of The keystone they have a load balancer whip and then another dns entry So from the user's point of view, there is one single entry point We may add more azs or remove azs the user doesn't see all that we can fail over the user doesn't get impacted All the user sees is one global entry point for for for keystone So that it makes it easy for him for the user to discover Use the same token everywhere In every az Now the trade-offs are here is that obviously the the writes Because uh galera is a multi master Replication system which is using certification based rights rights take longer there's a penalty for rights because the rights have to be Propagated everywhere you look for conflicts And then decide to commit or roll back so right cost increases, but Reads don't go through the same penalty because they are they are read locally from the local databases So that's a trade-off. That's a billet we bit knowing well into the picture Uh early early in the process. So as of recently If I look at the stat the state traffic trends, uh, I I removed the graphs because they were not They were showing too much data than I wanted to They were about 10 to take new tokens per second on average That's that's the kind of traffic we are getting now and the peak was up to 100 tokens per second And so initially we had some surprises and like keystone getting down and all that So we we had to figure out and put some great limits and and and so on so on to ensure that we have The keystone is surviving the traffic The right latencies are high It is not ideal and we have to Think of ways to improve that part of it is also dependent on the underlying infrastructure, whether it's The LDAP or two factor authentication system So the cost overall cost is high and also the galera cost is they're adding up The read latencies if you directly go to keystone, it's about 200 milliseconds right now We started with pki we say very naive choice because we were upgrading from falsemon and we didn't change the configuration to use pki z And that realized oh crap. These tokens are getting longer and longer And we had use cases using sift and they are exchanging tokens and the and if you're exchanging a one kf a object You have ton of metadata with the request because of the token size So we went and switched to pki z. It was a very simple straight forward config change Which dropped the the token size to by 60 percent Which was awesome, but I think the most ideal situation is to have ephemeral tokens That that you don't need to replicate so that we get higher right latencies And also galera becomes much cooler system because if you take tokens out The number of writes that happened for users and projects is much much less traffic wise critical down to Many many times in terms of the traffic to the keystone itself so That that somehow that blueprint never happened in the in the community And there's also a great talk the other day I think by an engineer from ibm on the types of tokens and the new kind of token That are coming in in keystone We might take a look at it and and see if we can Reduce The right latencies and make the system more more resilient because when you have millions of tokens getting created every day we have also worry about purging the tokens without freezing the database So initial at our initial returns where we had some incidents where The script to purge the tokens kicks up and then keeps on froze because the entire table was locked And so we had to figure out we had to actually patch keystone to Make move the data of yesterday's like previous day's tokens to a different table and then restart started the fresh table So that the cleaner operations don't impact keystone and these some of these changes We couldn't get to the community in time because these were incidents happening like Then and there we had to fix those things Hopefully we'll submit some of those patches upstream The other I mean this another interesting point I think some a lot of you probably know this if you are using keystone and open stack for a while is is that you know the more ages you add The more cloud services you add The tokens get a bigger and bigger and why is that? Because every with every age you have a set of URLs in your service catalog If you have 10 services in each age and you have 10 ages you have 100 URLs in that in the so your catalog is growing Which is which is and we have a lot more services getting Added to the cloud. This is not just open stack. We have the entire provision deploying monitor immediate services in the control plane All using keystone for that identity and access management. So this became a humongous problem It's pretty dumb if you think of it because why would you want to put a catalog in the in the in the token? And so we ended up patching keystone To reduce the token size to somewhat manageable levels Now we had a point it's around 2k We don't care about Catalog in the token anymore. We had to patch some parts of open stack to to get that The the next the next topic I want to touch upon is is two fact authentication. I think I think it is pretty clear that today if you are using Anything that is important You got you got to use two fact authentication not single fact authentication is no longer The right way to do things we came to this realization late last year And and keystone unfortunately was not where it needs to be there were blueprints from 2012 and so on but I think there was no activity around it and this is a bit somewhat Surprising given that a lot of enterprises are using open stack And they're using keystone the fact that uh two fact authentication doesn't exist This is I think is a shame on us All of us so we started looking at options to do two fact authentication And in order to simplify the policy We look we went back to how we create vpcs and projects and all that so in our model a vpc is nothing but a label So when you create a project in a in keystone Uh, we put a label on it the label says no you are market ebay marketplace production or you are Dev or you are something so that label is a is a text sting label And that label tells all the policies so once once I get a request Based on the project. I know what vpc you are you belong to and what policies apply to you throughout open stack deployment so, uh We define the policy let's say x vpc requires two fact authentication or y vpc does not require two fact authentication based on that We can enforce it completely It is entirely dynamic and configuration configuration event that means as an operator We can change the policy overnight and gets rolled out fairly quickly through our config management system And the challenge that we had when we thought about two fact authentication is how do you keep All your client libraries work with work with Work with keystone because if you add a new extension suddenly you have Python Swift client or some other client would start breaking because it doesn't know how to authenticate itself with keystone So we decided to overload the syntax of V2 tokens API To for second fact authentication So as far as the user is concerned the user still submits His username and and his password the password is typically, you know, if you're using rsa It's typically your your pin like a four or six digit pin followed by a changing ever changing Identifier, so you submit that And and and in the background because we know which project we are trying to authenticate to We know that you need two fact authentication and we would enforce we would check with the with the backend system for validation It's fairly straightforward as far as the user is concerned as we had to figure out like maintain make sure the compatibility is there and So token validation is also very compatible. There's no change that you have to make to support do two fact authentication I think the the yucky part of this exercise was Resync as you know some of the two fact authentication systems like rsa They have this weird Resync protocol. It's not a standard. You have to call up an it guy. Hey, my token is not working. What should I do? You would say read your Number and then he would resync. So we had to do some clever extensions on open stack Sorry keystone to support that extension so that The portals in adb that use keystone for authentication. They they can they can prompt the user. Hey, you have to Wait for the next 36 60 seconds. Give me your new token and then you would recent it So we build those extensions It would be great if we can actually as a Standardize this model because I think this is very important to fact authentication for all of us This has been going on we have been this has been in production for about six months We use we have enabled it for certain vpcs not all vpcs yet because the back end is also Not heavily scalable. You want to make sure it's it holds the traffic the the third extension we did Which to api to keystone is what is called as api key? I think the idea is fairly straightforward and well known In older and more mature cloud platforms like aws So the idea of api key is to support headless use cases headless use cases when a user is not Presently interacting with with apis through some portals or cli is there's probably some code that you have In running in a vm it's running in maybe a form of vms And those vms need to do something with a control plane. It could be let's say getting data in and out of shift Or you have you're creating Vms because you want to flex up something A flex up a pool of a cluster of vms So we had to support the headless use cases But putting your cop credentials in and distributing in and vms is a dumb idea because you're leaking credentials everywhere And two factor authentication doesn't work because they are time sensitive so we had to think of a ways to See solve this Without increasing the exposure of of cloud in terms of security is concerned. It's a very we had to make Several iterations. I think the lightest iteration is going to prod Just now to make it more and more secure to reduce attack vectors So the more the principle was raised is very straightforward as a user. You would use an api to create a secret an id and a secret And use that id and secret in place of your actual credentials and you can create as many of these credentials as you like these are in effect temporary throw away credentials And you would though you could rework the as a user you could rework those credentials anytime So let's say I deployed an app And there's a 500 vm app and I pushed the credentials Into that those class that cluster of vms and I can decide if I Something happened I can rework as a user All the talk all the api the keys that I I minted for my project and everything is good back to normal So that is the principle we used amazon has some features like this I believe hp at one point had something something like this as well in their helion distribution But I don't think these ever made it to the The upstream yet and that's something that we need to work on The the the model as I said is a very straightforward I think the challenge was that how do you make how do you reduce the surface vector apologies the diagram is not very well formatted for this size of the screen Because of the way the browsers behaved in when I was doing the screen capture The idea is that when you let's say I have project p1 and I want to use an api key and I want to make sure that You can use that key pair that id and secret only in that context and not not use it everywhere Because if you let it use be used everywhere You'd if it such as things like two factor authentication. Just imagine. I'm so lazy. I don't want to carry my RSA pin And I find this feature. Oh, I could mint a temporary credentials. I could Put that and start using this key pair And use it until they expire. So I'm actually bypassing Two factor authentication policy. So we want to make sure it's very restricted So we wanted to we did it to took an approach where You can only operate in a limited context And with a certain limited roles And not use it broadly So you still have to go back to your two factor authentication if you want to do something something bigger And this is you limited for programmatic use cases only in certain when the source and destination are well known So the way it works is that you log into the key see keystone. You have certain roles. You belong to certain groups You have your auth token and then you say hey, give me a new id and secret And at that time you can choose to grant some of the roles you already have if you have role a and b You can say I want to grant role a For this use case and I want to grant role b for this use case But if I don't have role c, I cannot grant the role c for either of my tokens So I'm basically creating a subset of my abilities When I create the api key and then I can set an expiry And if the expiry is not set A default kicks in that's something that we as administrators set sets expiry And I could also choose a scope where from where am I going to use this? Because again, this is this is to ensure that this can't be widely used So the default policy is that if you are doing operations if you have a dev project And if you're operating in dev resources, you would use it You can't take the same and use it in a different context different let's say for prod use cases So Optionally you could you could limit even further limit your operations to let's say I want to use this for this subnet Or this set of ip's So you further restrict the the scope of The keys and then the outcomes is an api key Which is an id and secret with an expiry minted on it And then I can use that in place of user id and password anywhere with every open stack api fairly straightforward once the token is created api is again Straightforward you specify a source project optionally if not it is default to the current project It's an expiry You give an optional set of roles as a subset of the roles that you already have And similarly a set of groups and optionally set of ip addresses And if you don't give any of this they default to the your vpc current vpc And then again in the implementation we have several checks and balances to make sure that your authentication is fairly bounded That means you can't get out of things and you don't that you're not allowed to do So if the color source for example, if the token is minted with a set of ip addresses And you're not coming from that ip addresses. We block you If if you are using from a different vpc The token doesn't work If you are using from in a different project. It doesn't work. So it's fairly limited We still this does not take away the need for Need for using, you know, key management systems for distributing keys and all that this will increase in secrets You still need to do that This is just one extra knob That that that we added to keystone Uh As I said, uh, we have never we have not committed any of these extra enhancements to upstream and We are working on figuring out the Submitting blueprints and and code summits Uh to upstream and the reason why we want to do this is actually it's not just because it's cool because it's open source But the main thing is that we want to make sure that Open stack becomes a that's just standard set of apis for anything that is cloud related for instance Our kubernetes journey Is relying on open stack heavily and we want to make sure The code that we write to provision and manage kubernetes clusters on open stack Can work with standard apis and not does not require proprietary extensions that are epa specific or some other company specific That's the reason we want to we want to take this code out and and get some consensus in getting these features out I think that's my last slide Uh, we have I think ample time for a Q&A in about five minutes Thank you. Yeah, please go ahead So, um early on you mentioned that you had a global keystone So is is that a logically global in terms of how you manage the service catalog or is it actually a distributed? implementation and The second part of the question is that in the tokens you mentioned that obviously between the different availability zones That the tokens would get very large and I wasn't quite sure exactly what you did to to reduce that or how to manage the Um, the the size of the service catalogs is that were in the tokens or how you managed access to the data That was in the service catalogs and where they use endpoint filtering to To you know limit the size of those things to what people actually needed sure So the first question was about Is is the what does it mean by global keystone? So to take a step back we have this notion of azscope services and global services in our in our cloud And keystone is one of the global scoped services or the most services are azscope services When we say something global it is still distributed deployed in multiple availability zones independently managed But since keystone is stateful We may ensure there's a global replication and a global address address for the customer. So it is a distributed system Okay, and you manage the replication. We manage the replication through through galera at the bottom of the stack Okay, so data is so if I create a project Let's say it gets created on the on the az on the left side And then you might the read might go later on here and you would read it So the same projects and roles and everywhere exist if I go to horizon Which is the dashboard we use and I would see all my projects from everywhere Right, and do you expect all the same roles everywhere in every available? Yes, we expect the same roles everywhere Okay, so that because otherwise it becomes a mess and and Coming to the second question, which is how did you deal with the the catalog size with keystone? So we actually patched keystone as well as some of the python client libraries Uh patch keystone to remove the catalog completely From from token, right, which is easy and then we ran through several tests to see what breaks And then fixed patched the things that broke to not depend on the on the on the catalog So you did sort of like a lazy evaluation or a lazy look up for what you would need Absolutely sort of service cataloging. Absolutely, which I think is the right thing to do because yeah I mean You can always go back and ask a there's an api to get catalog You get it and use it instead of carrying it everywhere, which is pretty heavy. Yeah. Yeah. Okay. Cool. Thanks Yes, please I can't I don't have the breakdown of whether how much of his headless are or not I think we we actually we've Because we have the cloud is open for everybody and we ended up spanning Sometime not a whole lot to figure out who these users are This this happens because clients they get some code some live client library in java or python and they start using it without thinking of caching tokens and they're using tokens for certain duration and that was That continue continues to be one of the main reasons that the tokens are so many tokens get generated and they also use cases that are Shockly with sessions So you log in you do something and you log out So when you somebody logs in someone the same thing so That is another driver for No, these are ebay users we have Our keystone managers say Everybody in our company every developer every every PM every employee has can log into keystone and do things on On the cloud. So it's it's potentially thousands of users. I think when I last checked It was we had six seven thousand users in keystone and hundreds of thousands of projects Again, it depends on the policy. Yes for for some cases. Yes any other question. Thank you