 Hey everybody, thanks for coming out to the one of the last talks at dev comp on Friday My name is Dustin Minnick. I'm going to talk about how Red Hat IT runs our SSO servers So if you've ever logged into access.redhat.com and filed a case or Logged into our websites and downloaded some of our software. We're done any of that kind of stuff That's that's kind of what my team is responsible for doing So I'm going to start with a problem statement The problem statement is SSO is very very important if SSO is down We can't take your money tons of other applications won't work and my team gets yelled at like crazy so Of course at Red Hat we're gonna do everything we can to use open source software So outside of the you know the free aspect the financial gains of that and also the philosophical aspects of that Using open source software allows you to do flexible things and Flexible in this case can mean things like we have very specific specific business requirements from time to time like our legal department says You know you can't sell specific software to specific countries So those are things we implement actually in the login flow There's also cases where Since we have this up and running we could potentially link that up with other data So inside instead of going to a SAS provider that does SSO We could start to mine like do data mining and say okay this person logged in the last five times He logged in he looked at OpenShift Well, maybe the sixth time he logs in maybe we'll send him to a page that is about OpenShift And also is going to give him a discount if they want to purchase it So just that that availability of data could let us do all kinds of things we could potentially change basically the login flow or what somebody is doing on a per user basis if we wanted to do that and That's something you won't get with you know the proprietary software or SAS providers of single sign-on the other thing that Red Hat tries to do is focus on hybrid cloud and that's because It prevents lock-in You know people used to complain all the time about Microsoft does things this way and Microsoft's writing their own standard And in a lot of ways Amazon AWS is starting to do that same thing or has been doing that same thing So it's kind of important to us that we run SSO both on-site and some of our data centers and that we're flexible enough to run it in other cloud vendors solutions, so one of those things is We have people in our company people in our data centers that make mistakes So that's what I call clumsy Kyle prevention. So we have a guy, you know He'll touch something you shouldn't touch or he'll make changes a little bit too cowboy admin things will go down Your companies probably have people like that. So do other cloud vendors So even if you're paying exuberant amounts of money to Azure or AWS or whatever They're still gonna make mistakes. So it's important to have you know, your systems in multiple places it's also a risk adverse because As time goes on depending on what industry you're in and what you're doing The people that you're hosting your SSO servers with or any servers with cloud wise Could start to become your competitors. So they could start to jack up your prices or in worst-case scenarios, they could start competing with exactly what you're trying to do so Red Hat has Justifiable fears around that. So we try to spread our or spread our setups to different cloud providers where possible So this is a slide about our infrastructure. We have three sites and each site is running multiple RHSSO servers. RHSSO is the the downstream. That's the productized version of key cloak and each one of those sites those RHSSO servers then Talk to JDG JDG is JBoss data grid, which is downstream from InfiniSpan and then they also talk to Galera And then the JDG shares information between the sites and Galera shares information between the sites So we have global load balancing that routes you routes your request to the closest geo And we have a load balancer in each geo and then we have four beefy RHSSO VMs That store and read data from MariaDB VMs and JDG VMs So like I mentioned before MariaDB and JDG VMs replicate to each other across sites and we're currently sustaining about a million unique logins on a daily basis and We were able to sustain a full data center outage with this setup. So we were pretty proud of that a Lot of the other functionality that Red Hat offers through some of our other services I Did not fare as well so people could log in but not actually do that much, but that made us look really good How we manage the different sites and the different cloud providers is with Ansible So we have Ansible roles and playbooks that will go out and stand up VMs and Rev or instances in OpenStack or instances in AWS and We also do our releases using Ansible So the external or the RHSSO instances that we run They're very heavily Customized so that's one of the great things about Key Cloak is you can write your own SPIs to do different things So and I'll get to this here in just a second, but basically out of the box RHSSO or Key Cloak supports user federation through Kerbal up through Kerberos and LDAP but our business isn't using those things to store our user data for our external customers So our team has been able to extend Key Cloak internally to support where our customer data actually lives A little bit about the software if anybody was in here for the talk that Mark gave He already went over all of this in better detail than I will this is a one-slide thing But Key Cloak RHSSO is a stable flexible multi-tenant capable federated SSO server So it also supports your standard SSO protocols like SAML, OIDC and OAuth I mentioned that it does Kerberos and LDAP user federation It also can do brokered and social logins and it's manageable by a GUI or a REST API and One of the neat things that we're starting to explore with brokered logins is actually letting Different cloud providers or different vendors talk to our IDPs So for instance customers that are running VMs in Azure can now Fairly seamlessly come over to our support portal and file You know instance or request and tickets and stuff like that So even though they didn't buy an entitlement through normal channels They are using an entitlement hourly or whatever however Azure charges for that They can come over just like a normal customer would Without having to go through the the annoyance of lots of registration screens on both sides And go ahead and submit tickets and get its support I did mention that Key Cloak is very flexible and again, we're doing We have all kinds of custom SPIs that allow us to talk in our case. We're pulling user data from Mongo We also make REST API calls to other services that people have written internally so we have APIs that we call to get additional information about users and groups and we also have APIs that we call for legal reasons and Other things of that nature Here again, we could extend this to be as complex as we want it to be since it's flexible and open source In finish ban or Jbos data grid is a distributed in-memory key value store If you're not familiar with in finish ban in General you can think of it sort of similar to Redis if you happen to be familiar with that or memcash So for what for our setup? We've configured it in a replicated manner. That just means that Whatever we're storing on one site and the other sites get the same information of or we they get the same information and The in finish ban cluster with JDG stores runtime information so it stores things like user sessions and offline tokens and This allows people to do data center hopping without re-authentication, which is pretty sweet So again if our primary data center goes down or you lose the sticky session that our Global load balancer provided you with and you now start going to a different data center You're still going to be logged in We run Maria DB and you know, that's kind of just the standard database that tons of things use And it stores basic key clove config integration settings So whenever you configure a client and key cloak, that is like a a SAML integration or a OIDC integration So it stores those settings for each one of those clients and also caches some user information And we use Galera for the Maria DB replication and we do that because it's synchronous in nature We did that to prevent race conditions. So originally when we had multiple data centers We have some really interesting customers and interesting people that resell some of our subscription models and basically We would have data center one and we'd have data center two and within the same second an account would be logging into both Data centers at the same time So, you know that that's not coming from like one specific device because it would have been tracked with sticky session or something would have happened but We have this we have weird situations where within a you know within a second People would be logging into two places at once and those I'm told are valid use cases so this Allows us to get past that and also we had an issue with OIDC authorization code flows So the authorization code flow would be your browser is going to get a code bat It's going to hand off that code to a back-end server And then that back-end server is going to reach out to the SSO servers and swap that code for a token and What we were seeing happen is that was faster than the Maria DB standard replication was So the client would go and they would get a cookie and stick to DCO one they would go ahead and get that code They'd hand it to the back-end server back in server now does a direct call out So he's not stuck to the same data center that the actual end user went to they would land on a different data center and That data center would then say I don't know anything about that code because that happened so quickly that the Maria Maria DB asynchronous replication had not copied that code to the other data center yet I've already mentioned some of this stuff. So on top of those core products We have our own custom special sauce. So we have user federation back ends I mentioned that this will allow us to query Mongo for users it also allows us to do things like legal checks and Things of that nature. We also have different login and registration flows. So for different types of users we have user classes whenever they log in they might get different screens and that's especially true when it comes to terms and legal agreements and We also have different registration flows. So some of the stuff that key cloak provides out of the box is a great foundation and We'll actually meet most people's needs but we have a big enough company and people are particular enough about how they want things to work that we had to take some of The things that key cloak provides out of the box like user registration and then customize it even further to do What the business actually wants it to do Brokering I mentioned where we're setting up cross IDP trust between Different vendors and our SSO servers and our teams also Written a new protocol support. So we now do Docker auth. So if you download containers in an authenticated fashion from The Red Hat Container catalog Whenever you do that Docker log in you're actually logging into the SSO servers a little bit about our future is We're trying to contemplate now that we have three sites up and running And it's working. Well, we're trying to contemplate what to do next One thing we haven't really tested all that much yet is how well we could do horizontal scaling versus vertical scaling right now each data center has four key cloak nodes and each node has I want to say eight cores and eight gigs of memory and we've actually seen that those taxed at times and I mean, it's not it's usually not core key cloaks fault It's usually the custom code that we've written on top of it is not optimized and our developers have made that problem largely go away, but There is still, you know, it can be hefty and it can use a lot of resources. So We have to we have to figure out doesn't make sense to move that to containers Will it scale like one core and one gig of memory? With, you know, eight containers versus or whatever, I guess it'd be more like 16 or 24 containers instead of four large VMs Better auto scaling would be great. So right now we kind of notice if things are Being taxed too heavily and then we kick off some manual processes to scale out some We're starting to ask more sites and we're starting to question if we should be Thinking about if cheaper and more diverse is better So right now we have some one of our sites is AWS and the servers there are spanned across multiple availability zones and they're all highly available like all the high availability stuff is done But I think there is questions to be asked like Instead of doing that. Why don't we expand? Why don't we use one availability zone and maybe one server and each available or in each region so you know instead of Having big clusters in each avail or in each region Let's have more regions and smaller clusters in each region because then we would serve people closer to them We'd love to also look at blue green or canary deployments. So Again now that we have multiple data centers working I would love to be able to say okay developer and my team is going to release a new feature Let's send 10% of our traffic to that new feature That's only running in our AWS cluster or AWS US East cluster And then we can watch and see if there's any issues instead of rolling it out to all three data centers at once I'd love to see us use some chaos monkey type full-proofing so Once you start doing active active active across multiple data centers you run into weird Situations and they're pretty hard to troubleshoot because it's hard to track back the path that somebody took and where the logs are or even have reproducers so It goes from you know Well, I'm gonna hop on a couple boxes and look at logs to okay We definitely have to use splunk and look at all the logs across all the data centers And then we have to do you know performance profiles of each one because like AWS's T2 large isn't the same as Azure's whatever it's called And then you're also, you know your victim to however Whatever hardware they're running and whatever oversell factor they have and the same is true to our own data centers But you got to you got to be able to watch some of that stuff And that's harder to do with the more differentiated and more spread out stuff that you have Cloud data enrichment is something that Key cloak will allow us to do because it's flexible and open source So again, I was saying that we have data about our users and Mongo locally but We also have SAS vendors that store some data about our users and we can now Start calling out to SAS vendors and getting that information back about a user and then include that in the SAML assertion or the the access token and open ID connect and We could do that Again like on a per user basis or however we chose to tackle that problem So again, just just a shout-out to the flexibility is great And the other thing that I'd like to see is us having site management easier So our our global load balancer is I mean, it's a vendor. It's a well-known vendor, but You have to manage it basically through a web page and the web page is hard to navigate and people They like to have like a person involved and yeah, you know professional services and back and forth I would love for us to at least have some scripts to do some easy stuff like okay Bring down data center to and we just fire that off instead of having to Get other people and other companies involved That is all I have does anybody have any questions? Okay, the question was talk about the three sites basically what geos are they right? Yeah, so right now we're doing very poor as far as geo distribution goes we have We have some stuff in I believe it's either us east or us west AWS I don't remember. I think it's us east and then The other two data centers are also in the US. So that's not good at all Where is our actual user data stored so our actual user data like user names passwords email addresses stuff like that is Stored in a Mongo database So we have written a key cloak SPI that is similar to the LDAP federation That comes out of the box, but instead queries MongoDB The geo replication is normal. Is that the question? Oh, yes. Yes. Yes, MongoDB is geo replicated. Yes. Oh Yeah, and the US only Sure, so the question was is the reason that we only host stuff in the US because of GDPR concerns and I would say no, I think it's due more to poor planning We are running 7.2. I'm sorry. The question was what version of our HSO? Are we running in production 7-2? Are there any major issues that we face? so yes when we first started going down the The hybrid cloud active active active path the we were doing that with key cloak and with Galera and JDG and all of those were us working with Engineering and getting that support added so when we originally started on that path there wasn't much Support for that in the product yet So we've been helping them grow and they've been fantastic with growing support for that as well, but In that growing as you know hitting bugs and having to file things and working through issues May get anything else Yeah, so the question was what is the valid use case where customers would log in to various data centers at the same time and some of that was basically Automation test by some of our developers and the rest of it was a very specific vendor That resells something that's sort of similar to satellite and to their customers with on-site appliances and They do this really nasty thing where they like scrape our login page and then enter credentials and Then that logs in like so every one of their devices that they've ever set up is scraping our login page and logging in as the same credentials We've been trying to get them to do more same things Be ready and hire people It's I mean it's a lot so it depends on what your company is trying to do as well So at Red Hat we have a lot of people developing apps And it can be somebody that is literally running something on a computer underneath their desk And they want SSO support for that device or it could be any manager That has a corporate card wants to buy some sass service So we have like I think last I checked 120 or 130 SSO integrations on our Associate like the what I just talked about here was our customer IDP We also run an associate IDP for actual Red Hat employees and When when you would purchase something from a sass vendor and they offer SSO support One would hope that they would actually know what they're doing, but that is not always the case So we find ourselves spending like six months going back and forth with some sass vendor who has decided to write their own SAML library instead of using something that is already out there and tried and trued and tested So outside of dealing with That kind of stuff and knowing the protocols so that you can offer support You also need to know, you know key cloak and JDG at least for me adds a whole another realm of complexity Because to me that's like black magic. I don't know how that works But luckily there's a guy in my team that does so And Maria DB like that's easy stuff that's people done that for a long time key cloak itself is easy to maintain and Maria DB is easy to maintain All right, looks like a matter of time if anybody has any more questions. I'll be around for a little bit I'll also be here tomorrow. So feel free to come up