 Hello, everyone. Thank you for joining. No merosis. Today I'm going to present the art life of securing a particle accelerator. I know it's a catchy name. And security at CEN is a hot topic. And cover many, many aspects. Today we are mostly covering the identity and access management part. My name is Antonio Nappi. As you can understand from my accent, I'm Italian. I work at CEN since 2015. I was in charge of moving. I mean, my role is to basically provide infrastructure to host Java application, useful for the daily life of CEN. My role was to basically move this infrastructure from VMs to Kubernetes. So I started really early to work with Kubernetes since 2016. And previously I was an open stack and Python developer. And my name is Sebastian Wopieński. As you can perhaps guess from my letters in my name, I'm Polish. I work at CEN since 2001. Now I'm the service manager of the single sign-on service. But for many, many years I was doing computer security at CEN. So that's my kind of main profile, I would say. And my background is software engineering. So perhaps it's a good moment to very briefly remind what CEN is, although perhaps hopefully many of you do know. So it's a European laboratory for particle physics. So the slogan is accelerating science. But what we, or rather what physicists really do is they study the fundamental laws of nature by doing experiments with particles. And this will be actually relevant to the requirements to our systems. So that's why I'm covering heat here. So we operate a number of particle accelerators, including the Large Hadron Collider, in this machine. You can see the ring here in the countryside around Geneva, Switzerland. But actually the accelerator is 100 meter underground. It's 27 kilometer long. So this is where those huge machines, the particle detectors, which you can see here in the picture, they observe particle collisions and they observe perhaps new particles being created. And this is where in 2012 the Higgs boson was discovered. It resulted in Nobel Prize in physics in 2013. However, closer to our domain, CERN is also the place where the web was born in 1989. So Tim Berners-Lee, who you can see here in the picture, he designed or invented the HTTP protocol, HTML language, implemented the first browser. And well, it wasn't called browser at that time. And certainly the first HTTP server. So there is also some computing part to CERN certainly. What is also relevant is it's really an international organization with over 15,000 scientists from all around the world working together in this logic of peaceful collaboration for science. It's fundamental science, so we don't do applied science, we don't do research on nuclear energy or weapons. It's really fundamental science. With nationalities, with people coming from all possible nationalities. And again this will matter in a moment. So as you could see from our presentation, myself and Sebastian, we have a different background. It's more focused on the part of security and key clock application. We are more focused on the infrastructure and deployment of key clock. So we are going to cover this in two parts. So Sebastian, we start with the key clock specific part and then I will focus on the part of deployment infrastructure. All right, so I don't think I need to convince anyone in 2024 why having a centralized single sign-on service in an organization makes sense. So I will not cover all the details, but just to mention that this is one of the unique cases where you can achieve usability, security and cost efficiency at the same time. So obviously single sign-on is the way to provide authentication and perhaps authorization to different resources in a big distributed organization. So what we use at CERN is key clock software. Now, just to have understanding how many of you, if you could raise your hands, if you are at least somehow familiar with key clock. All right, excellent. So most of you, which is very good, which means you're in good room, that's great. So I'll not cover too many details, but for those who maybe are less familiar, it's the open source identity and access management solution. It provides single sign-on with support of multifactor authentication and second factor authentication. It could be OTP, some one-time passwords. It could be web-offend tokens and with role-based authorization. It allows for user federation with active directory or LDAP or Kerberos servers. It supports external identity providers so that people could log in with... in different organizations and still have a session in CERN SSO, in our case. It supports social login, so people can log in with their Google or LinkedIn accounts. And what is very important is that it uses... it's built on standard authentication protocols, well, the modern ones, such as OOF2 or OIDC, OpenID Connect, or maybe not so modern as SAML. I'm not a great fan, but anyway. And okay, we're here, and obviously I think you know this because key clock since spring last year is in CNCF incubation. So maybe I should mention briefly why CERN has gone for, you know, first of all, on-prem, you know, single sign-on service and not in a cloud as an organization. Why do we go for open source and why we have chosen key clock? So I mentioned that we operate, you know, particle accelerators and experiments, so all this technical infrastructure must not be interrupted and computing systems that support it, and then I must really work while the machine is turning, which means that we really need to have full control over configuration of the system but also over release and patching cycle of including single sign-on service. And we need the service to be available from our internal industrial control systems network, which is a private network with non-routable IP addresses. So they couldn't, you know, the network could not go to the cloud. Well, we value openness. You know, open source is really compatible with CERN's initiatives such as, you know, open science, open access. I would say open source is really in our DNA. And we really wanted to avoid vendor lock-in and, equally importantly, to avoid the situation where we didn't want to be subject to, you know, sanctions or export laws. We may have, I know, scientists from Iran. We want them to be able to authenticate to our services regardless of what maybe is decided for good or bad reasons, you know, by some politicians somewhere. So this is important for us. And then key clock really fits our needs. It has a lot of big adopters, which proves it works at scale, a very nice growing usage in academia and research, very strong user base, actively develop with many frequent releases. It is extensible, which I will mention later how we use that possibility, so it can be adapted to our needs. So we use key clock, actually not me personally, but we started with key clock in 2018 with key clock 4. You may know that these days the recent release of key clock is 24. So we really started very early on. So this is how the service looks like right now. We have 200,000 users, including external people who connect to CERN for whatever reasons. That's a lot, but that's perhaps not a huge number. We have 10,000 OIDC clients, which is mostly web applications, but in key clock and OIDC standard, as you say, clients. And that's a lot. Normally organizations don't have so many applications behind their single sign-on. Well, this is because that CERN everyone and their dog can set up a website and put it behind single sign-on. And we have 10,000 logins per hour during office hours, so you can see it on the right. For those of you, especially in the back row, which if you cannot see the shape, just to explain to you, it's actually, I don't know what you think, but it's basically logins in the morning and in the afternoon with the break for the lunch. All right. So the service, our service, again based on key clock, so it's really using key clock features, which is, of course, to factor authentication with time-based one-time passwords or web-authent tokens, which could be hardware tokens or biometric devices, fingerprint readers, and so on. Kerber's authentication. It offers integration with EDUGain, which is this educational identity federation. And it means that, actually, people can log in to CERN, or have a session in CERN single sign-on, by logging in at their university with their university credentials, right, and then have a session at single sign-on at CERN. It doesn't necessarily mean that they're authorized to do anything yet, but at least they can connect and perhaps be given privileges that they need. All right. And we also support social logins, Google, Facebook, GitHub, and LinkedIn, and guest accounts, which means if someone needs to log in to a CERN system and have a session, you know, identities, they can just create a local identity with an external address. So, our single sign-on is tightly integrated with what we call CERN authorization service. I will not go into detail, so, you know, the diagram on the right is not readable on purpose, it's just to show that it's a complex beast. So, it's a service that manages identities and accounts, applications and their authorization, so, you know, roles, level of assurance, and some 80,000 groups. And now the decision back in 2018 was to implement this outside of key clock for various reasons, which perhaps I will mention here. Well, you know, we at CERN have some complex use cases, which sometimes necessitates that we develop our own solutions around them. And at that time, 2018 key clock was much less matured than now, and some of the capabilities were limited. And we also wanted to have this possibility to switch from key clock to another solution. We didn't use that possibility and I don't think we will use it now. You will see that we're very happy with key clock, but however, this was the decision taken at the time. Now, it's important to mention that key clock does provide support for pretty much most, if not all of those features, maybe not exactly the way we would use them, but still. So, perhaps if we had to take the decision now, we wouldn't have done it anyway. All right, so I mentioned the extensions that we do. We put into key clock so that it does some specific things that we want. And this is using so-called service provider interface, so SPIs. So we can provide our providers that run within key clock. So part of this is to do integration with this CERN authorization service that I mentioned before, and which also creates identities for external accounts if they log in to CERN for the first time. We, of course, have our own CERN team so that the login pages for the users they have CERN look and feel, which is kind of normal, obvious. But also, a tiny but nice thing that I like is that we developed this provider for the admin console team so that depending to which environment you connect, there is a different color of the banner. And this is very simple. This is just to avoid that we do something, change something in production while we think we're connected to dev because we have so many tabs open. So it's stupid, it works. When I see red banner, I don't click anymore. It really works. All right, and a few of the extensions that we develop, which actually may be, which are not CERN specific and which can be useful perhaps for other people. So one of them is OTP validation endpoint. Because if a given OTP, a one-time password code, this six-digit code, is currently valid for a given user. So why would we expose this? Well, because this is used by our custom PAM module in SSH servers to enforce 2FA on SSH access to sensitive machines, bastion hosts and so on. So that's how it works. Which also means that we have the same OTP for web access and SSH access. We also have a provider to detect compromised passwords. So how it works is that when a user logs in, of course we don't keep clear text passwords anywhere, obviously. But when a user logs in, the password is hashed and then compared against a list of known compromised passwords, we have this huge database that comes from have I been pawned so Troy Hunt and many other security sources. The password was compromised somewhere and it appears on some list. And we have our own CERN CAPTCHA which is used when guest accounts are being created and it replaces the default Google RECAPTCHA which we do for privacy reasons and also for availability reasons so that people could register their guest accounts from countries where Google is perhaps blocked or not available. All right? So, you know, of course that you use, if you start using it, you know, seriously, you hit some small limits and there are some challenges. Let me mention some of them. They're obviously, you know, minor inconsistencies, limitations in bug, a few examples is we very recently discovered that when we edit in key cloak, a user that is blocked already temporarily perhaps in Active Directory then editing the user blocks it in the key cloak database which means if the user is then unblocked in Active Directory, the user is still blocked in key cloak and cannot log in. So I think there is something strange here, we haven't really reported it yet but it's a very fresh finding but sometimes you find things that, you know, in some corner cases that hit us. You know, there are some small inconsistencies maybe the user name doesn't appear in the log message but the user ID appears or maybe the user name appears but in the user ID field so sometimes, you know, it's inconsistent and requires special parsing of the logs. Another thing is admin console which actually provides different features and different information depending which team you choose for the admin console which is kind of strange OK then, you know, major versions when we upgrade they obviously occasionally bring breaking changes including sometimes unexpected breaking changes. When we migrated from key cloak 19 to 20 open ID scope became mandatory in requests to the user ID endpoint. So this was to make it standard compliant which makes perfect sense except that some clients were not standard compliant so they were not considering that open ID scope in their requests and all of a sudden things started breaking. So there was some research. You know, some features stay in preview mode for a long time so one of them is this token exchange support which was mentioned yesterday they had this very nice presentation in the morning some of you may have attended it so it works very well we use it actually a lot so should we trust it, will it change so if you go to even key cloak discourse forum people ask about this regularly actually each link is a different discourse thread when people ask every year will it stay in preview mode what are the plans so it's a bit sometimes frustrating a little bit however and I'm very happy to say that just two months ago Thomas, one of the maintainers who is perhaps even in this room I'm not sure he published actually plans so one of the key cloak maintainers published plans to move this feature out of preview which is great news so thanks and very appreciate it this is also to show that things change and things get really better from one release to another one last thing I want to mention is that since we manage and I think by default you manage key cloak with the admin console so web UI it's great it comes with drawbacks there is no versioning of the configuration there is no change detection there is limited traceability who changes what if you have several admins well you can't find it in the logs eventually but you know this is what it is how we deal with it is that we have our custom solution to basically regularly dump configuration so realm exports realm and other settings manipulate a little bit the adjacent and objects are sorted so that are comparable and then push it to git which means whenever there is a change there is a new commit so we can track what has changed in our configuration if you manage key cloak I really recommend this approach thank you Sebastian as you can maybe understand the send single send on is probably the most critical services in IT department and the whole send the reason is that is not only used for the daily life of send it doesn't have access any application, administration financial engineering you need to log in but also experiment we're using it for data taking this means that if the accelerator runs and some of the tools that are monitoring the accelerator the data that is taken from the collision they don't have access to the single send on that's a problem so back to the 2022 I think November to review the infrastructure of the single send on because there were some issues with performance I put here how it was deployed at that time you can see that everything was managed on VMs with puppet so there was a layer of frontend where there were basically two machines one HEProxy one active and one passive the switch between the two was taking 15 minutes this means that if the active machine was going down there was 15 minutes where basically the key clock itself was not giving any I don't know then this was the frontend from the key clock back end servers where we had multiple VMs in multiple availability zone where all the key clock process were running together with infinite span this means that all the operation were much more difficult when for example the SISO team had to change SPI that to be really careful to not lose the user session as well as I said everything was maintained by puppet and actually there was probably only one maintainer at that time that was constantly update the puppet module to the new version of key clock so we decided then to look at this and to propose an alternative architecture the choice was quite easy we decided to move everything to Kubernetes the reason probably if you are already aware why we should move to Kubernetes but this was not so obvious in our department and to some of the management so we had to really to demonstrate that moving to Kubernetes was a good choice the first thing is that key clock direction was clear the Jboss that was the application server that was hosting the key clock application basically replaced by Quarkus that is a Java framework designed for Kubernetes and then also they were providing Kubernetes operator for deployment that was making things much much easier plus you get with all the advantages of Kubernetes the key clock itself become much more portable now we can move across multiple clouds on premise public cloud without any problem because we use Kubernetes as platform of deployment and then of course is reproducible and mutable this allows a lot to speed up operation and reduce the team effort because before a lot of effort was done was basically to maintain this infrastructure up and running more than actually focus on needs of the end user of key clock and then this makes of course much easier to maintain the infrastructure in long term because we have a vibrant community around Kubernetes and key clock while for the puppet world was basically one guy that was dedicating his time to a puppet module that anytime he could just decide maybe to move as well to Kubernetes or something else so this is how it looks like now the big change is that Git becomes the source of truth so we have all the part of login monitoring, key clock operator key clock configuration of the CRD in Git and then there is Argo CD that automaticaly synchronize this in multiple cluster I'm a huge fan of Kubernetes Cattle service model that maybe is not nice for Cattle but the idea is that basically all the key clock are running in different Kubernetes clusters each of these Kubernetes cluster is a different availability zone and basically we also decide to split the key clock from infinite span the reason I will explain a bit more later and watch which are the advantages but basically this was a huge change for us in infinite span actually on VMs actually not fully on VMs we use still puppet but only to spawn up podman so basically infinite span is a container running in podman and puppet is only used to configure the podman and that's it and last piece is gonna be here there is infinite span cluster dedicated and plus we replace also the load balancer where basically we have now a cluster of tree machine and we use floating IP and basically every time the active machine goes down the IP is moved to another one of the passive nodes and this is almost invisible to end users so the failover is almost so we don't have basically downtime for that and then I don't go too much in detail years because it will take too much time but basically we started to update also all the monitoring and logging part we replace the flume base logging part with fluent bit to rewrite all the parsing and so on and then also started to adopt Prometheus was already partially used but this was fully put in containers now as I said we had to demonstrate that this move was actually worth it because we had to spend some resources I mean the team at center is extremely small and sometimes people don't see reason to change if something works even if there is a clear game it is obvious so we wanted to demonstrate that since we are adding a new virtualization layer that is Kubernetes we are actually not losing any performance and so basically what we did we upgrade to version 20 of key clock and start to stress test basically we use this closed workload model where the number of user there is a number of user in this case 50 concurrent users were executing the same scenario in multiple times this means that more requests your server is able to handle more requests that will come to your server and this run for 10 minutes as you can see probably from the screen is that the new infrastructure based on Kubernetes and the separation of key clock from infinite span was 4 times more efficient than the previous one and we were able to end much more requests in a way to get the green light from management to go forward and now I want to actually focus on the split, infinite span and click I think this was the real breakthrough of the infrastructure why we decided to do that I think well first because of experience with Java application we as team we have a lot of experience with Java application with caching and we always prefer the model of having the cache separate from the application that sometimes in this case we demonstrated that was not extremely useful plus infinite span and key clock they scale differently key clock can be almost stateless it can scale from zero to whatever while infinite span depending on how much times you replicate the cache it has some performance issue so if you go through a certain result it has some performance losses so and also I mean one question that happened I remember there was an issue about key clock they asked what actually was the component that was failing infinite span or key clock in the previous model of VMs since they were sharing the same Java process was almost impossible to understand which process was using more CPU or more memory splitting them now we have a more clear view of what is going on in the memory intensive of course this simplifies a lot of operation because it makes key clock stateless now if you need to upgrade snspi for example the team or whatever you just need to restart quickly the pod key clock and this is almost invisible to end users because they don't even realize they don't lose session because this is infinite span while before it was still possible but you had to have much more coordination because you had to start first the first node waiting that was up waiting that the cache was replicated to another node and so on so the operation was taking much longer where now is you kill the pod it's up in 40 seconds and that's it and you are happy and no one see anything actually I tell you I remember that when the first week we moved to Kubernetes we had some issue with the Java settings they were too low and so basically for three days all the pods were restarting every three hours at different time but there was no complaint and I mean this is a service that is fully utilized every basically day because of course we have people at CERN that work from 8 to 6 p.m. but there are also people in the states in the Asia that are connecting through the CERN SSO at any time even in the night so this is a service that has to be always up I mean this is now actually fully documented in the key clock documentation but when I look at that I think it was yes beginning of 23 I remember there was no clear way to do that it was possible I had to dig a lot in GitHub issue to find some example people that were trying the same but it was not well documented actually this was I think a recent update maybe with multi-site setup so basically what we did we just created the config map out of infinespan configuration where we specify remote server we set up the infinespan cluster with adns alias so basically the three IPs are behind the adns and then basically we tell to go there and then we mount this file in the key clock through volume and volume ones are like any Kubernetes resource and then we specify through this cache config file option where key clock should look for this config file now we move key clock to Kubernetes in September 2023 so it's more than actually six month we are going for the seventh month and I think I want to highlight here the good things of this move first of all is operation as probably we all know make all the operation much faster and easier there is less time spent in coordinating because you need to be careful to infinespan, you can restart key clock without any problem and then we introduce the gigtops approach that allows us, I mean if you remember what just said Sebastian about there is not really way to track in changes still there are pieces that are in that database and they are not easily to track but all the other changes that in the CRD they are easily also docker images, they are easily tracked by git because basically there is someone, there is a merge request someone has to review it and so on and if you see any problem you can just refer to the previous comment we demonstrate that this infrastructure was much more realable than in the past, in the last six month we never had any issue while before was happening a bit more often and this is a redundant architecture as well so is even as more resources in the past that is kind of required because I mean all this in different availability zone and also I mean we know that now the time that we are spending on the operation and on the infrastructure is much less than it was before before was basically ejecting the time of the SSOE team now is basically just a small perfection that they have to do from time to time for example like upgrade key clock or things like that less good things you know when I went to manager in this presentation and then I show them there is unsupported field has been there for many years is extremely useful but maybe I mean to justify people a bit scared when they see why there is unsupported in something that is running in production is some hard to explain in finish finding BMs we want to move them as well to Kubernetes the main issue is that usually multicluster approach with stateful they are not best friends we have some luck in term of service match so we are looking at that to see how we can achieve that and then probably this is a bit provocative is there an alternative cash to in finish fund key clock is part of CNCF but in finish fund that is basically required to run is not so what is the future of that our plans I think we are almost there we want to prepare a BCDR plan because this is what we were asked I mean as I said this is a extremely critical service if it goes down the whole certain activities will be basically stopped and this is not reasonable we want to investigate service match for in finish fund deployment so probably something like Cilium to basically be able to run multiple instances of in finish fund in multiple cluster or even to look I think this test multi site deployment that was recently advertised in the key clock block so this is something that we will see in the next year I leave the floor for the key clock part thanks and the other parts of our plans is to define how we upgrade and how we keep upgrading key clock so how much we want to be far away from the mainstream most recent release do we want to apply all the minor version that appear you see that they tend to appear in that very particular case 24-01 appear just one day after 24-00 I guess we would never anyway go with production for the 0-0 new major release but still to we have to get our feeling how often and how we should upgrade we certainly want to contribute back to key clock we kind of very slowly started but I think the community deserves more so whatever we develop internally if it's usable we will want to contribute back and perhaps actually reassess whether we should be using or using more key clock so called authorization service which I haven't really mentioned much but because this is what we currently have implemented outside of key clock but perhaps we should just use what key clock provides alright and this brings us to the last word and last slide and there are actually two very simple messages we are very happy with key clock it's a great software with strong community behind so we absolutely have no plans to change we will use it and happily grow with and see how key clock grows and we are very happy with the move to Kubernetes hosting it's obviously the mainstream supported approach to host key clock it gives us much more reliable infrastructure as Antonio mentioned before it makes again easy to test and deploy changes it's just the way to do it alright and with this we thank you very much for your attention thank you any questions I think we have one minute for question if you have any otherwise we can just there are microphones in the middle of the corridor there is a microphone in the middle so I already have one question how do you deal with security events when it comes to key clock so you get a failed user login and it happens like 60 times in an hour do you actually monitor for that so the question is if and how we deal with security events with users like when users perhaps get compromised or when we see some suspicious activities we this is more done by the computer security team for 15 years there are various mechanisms in place it's a separate discussion but basically they do analyze the login logs that comes from key clock for various aspects including for example connections from unknown or let's say suspicious or unknown locations so for example if I connected from the train to Paris or if I connected from a hotel I got a message from your own saying you've connected for a location that you didn't do before is it you, if not please let us know please report and there are other mechanisms if I just may add as well we have a system to detect sometimes they are not only compromised users there are also some users that are testing key clock integration and they start wandering the service it means that people they do 10,000 login in less than 3 minutes that is not reasonable so we don't have a really policy we at the moment contact user and say please can you stop this and figure out what is going on but we would like to have some more autonomous to react and then I list these kind of users if they start wandering the service so thanks for the question I have a question so since you decided to split in finispan from key clock and running in a separate service why in finispan are not ready sketch for example why in finispan instead of ready sketch for example we cannot use redis because key clock only supports in finispan as first sketch so that's actually my provocative message is why we cannot support multiple cache system so that's well thank you all for hi thank you very much for the nice talk did you consider using infrastructure as code tooling like Terraform instead of exporting jazens and why did you not use it I think it was using most I think what Sebastian mentioned was more related to realme import because I think when we started the realme import was not part of the e-tops we actually discussed this yesterday and I think we need to align on that because when we did this there was this change between the old key clock operator to the new one and the realme import was removed from the e-tops part I see recently in the new versions actually the realme can be part of the CRD so probably this we will look at that I don't usually like Terraform because there is not this concept of reconciliation so basically you push something but then you well that's so that's why we usually prefer the githops because we are not sure that we are going to look actually I think it would be nice to have even more configuration in gith than actually in the database but this is going to be a difficult task thanks for the questions are there any other questions in any case we are going to stay in this room for certainly some minutes more we welcome you very much to approach us directly if you have more things to discuss or if you have other experiences to share because we also want to learn from other key clock users and I guess there are many here so thanks