 OK. Sorry for the delay. It was like my laptop doesn't like the projector by some reason. So I'm going to, I'm Jose Castrallion. I'm working in the CERN Cloud team. And this talk is going to mention what we are doing, what we are doing in automation, and how we are moving towards a CERN automated cloud. I'm responsible for the identity and the automation at CERN in the CERN Cloud perspective. And in the CERN Cloud team, if you want to know how it looks like, this is the Cloud team. And actually, this is the launch that we did to celebrate that we upgrade the whole Cloud to Queens. That was in July. And I think we are missing one, because we actually were in Rocky. But anyway, what we are going to talk about is to do a brief introduction of what is CERN and what is the Cloud service offers. We are going to assume and check the automation part, what we are doing at the moment, what the upcoming challenges are what we want to do. And if we have time, we can check also the source code, so I can give you what we are running. So CERN is the European Organization for Nuclear Research. It's the biggest laboratory for particle physics. It was founded in 1954, and it comprises 22 member states. But there are many more. And you have here the map of all the states that are collaborating with CERN at certain levels. And we are mostly known by one of the accelerators, the LHC. But this is the last of a complex of accelerators that we have running. This is the one that has 27 kilometers circumference. And here we have the map of how it's placed between the border between France and Geneva. From ionized gas, we inject it into the complex of accelerators, passing the whole history of CERN. And then they are injected in the booster first, then after the proton synchrotron, then the super proton synchrotron, and then at that moment, it will get into the LHC. And then this is where we collide the particles and we get the data. So we go through all the history of the accelerators at CERN. There is also a small one that you maybe not see from there. There is the anti-proton accelerator. This is just close by where I'm working. And if you want to know what it does, it's basically the darn matter factory. So the main mission for CERN is to do fundamental research in particle physics. What we do in the IT department is provide resources to the physicists to do their job, basically. And the cloud service offers a cell subscription service to the physicists to do their job, the whole laboratory, basically. Has been production since July 2013. And since then, we have been doing upgrades in place of the service, trying to keep the APIs as available as possible. The control plane and all the supervisors that we are running, they are on CentOS 7. And we have them distributed in two different data centers that are 20 milliseconds away from each other, that are one in Geneva, the other one in Budapest. The 9,000 servers that we have running, they are split or separated in more than 70 cells in a highly scalable architecture because we offer only a single region to our end users to simplify their use cases. And as I mentioned just before, we are currently running the rocket release before, so it's just separated before coming to the summit. This is the stats of the cloud at, not the Monday before, the Monday before, last week. So we have 300,000 cores that are available for the service. And we are using a slightly bit more because we are over-committed a bit. We have 36,000 VMs running in 9,000 supervisors. But as you can see over there, there are more statistics. So we have also Kubernetes clusters, Magnum clusters inside, or the other way around. Very metal nodes. We have also file shares and volumes. So we are offering all these resources to our users. The other thing I would like to mention is the flag over there that we create and delete. This is the ratio of creation and deletion of machines at any moment in time. So this is quite high, basically, due to two reasons. One is our users that are experimenting with the cloud modeling. So the experiments are creating machines, do their job, and then killing them later. So more cloudy way. And also, there's also a fraction. Even ourselves, we are probing the infrastructure just to make sure that it works, and the users can benefit of the cloud service. That's fast. OK. This I split it in two, anyway. So when we started with the certain cloud service, we started only with Nova, with Glance, with Kiston, and with Horizon. Since then, we were adding more blocks into the service. Now we have a network component, our container orchestration component, and a secret key manager component. With not only add-in services, we are also improving the offering on other areas. Like now we offer VMs, as well as physical nodes. We also, in storage, we offer block devices, and we offer also file shares. This layer is the building blocks in which our users create their applications. But then we didn't stop on that part. We continue even further with what we call the IS-plus layer. And we, ourselves, using the benefit of this underlayer, we are offering also more complex scenarios that we use in Magnum. That just you create with a single API entry point, you create a cluster that is basically behind the scenes using the underlying resources. The automation here, I'm zooming out the architecture we have at CERN. And basically, the automation that we do, they are covered by two different components. We are using Mistral and RANDec. In this case, I am only showing the integration that we have, for example, with part of the business logic at CERN. That is the resource lifecycle management that is covered by Mistral. And then on this other side, I have the service and host monitoring that is covered, in this case, by RANDec. I'm going to detail them a bit later. If we go back in time before we even have the cloud, the computing requirements for the LHC were increasing. So right now, we are in the end of run two. We are going to start the run three. The data requirements that work like that, they were getting exponential when we come to the last phase of the run. That is the high luminosity LHC. And the red bar is what we can afford to handle with the budget we have and the expectation of CPU resources that we have. At the same time, to add it even further, we know that we have a fixed team that is going to manage those VMs or those servers. So we need to double down and do more manage, even more servers, and be more efficient to scale out the infrastructure up to those levels. For this, when we started with the cloud service, automation was an enabler for us to be able to scale and manage the infrastructure. It was considered early on. And we are doing as much as possible. Because if not, we are not able to handle or manage the cloud. Now, the situation didn't change much because now we have a 300,000 core cloud. And we are still increasing. You can see here the rate of courses utilized over time. But we are not only scaling out the cloud. We are continuously adding more services. We are also improving the existing ones. We are offering more features. We are fixing and providing more added value to our users. One thing that stays the same is the number of people. We have the same size of the team. And that doesn't change. So automation changed from an enabler to key. Now we need to do automation. I'm putting three examples there. For example, if you have an issue on a specific problem, if you code it and you create the workaround and the fix for that issue, you are doing two things in one. So one is you keep the knowledge on the team. So then this issue is knowledge for your next guy that is coming and doing the job in the cloud team. And the second part is that you prevent it to happen. That's probably the other way around what you wanted. You also, if you have some tasks that are tedious or repetitive or that can be handled by other teams, like for example, if you have an error on a disk, you can automate that task and only leave it at the moment. So you only have to replace the disk. For that, we have a specific team at CERN that replaces the disk, the CPUs, and the motherboard. So then the whole automation to drain the node up to that moment can be uploaded. And it frees up you for doing other stuff. And at the last point, it's like we want to empower our users to manage themselves. So then the users can do as much as they so we can upload all the workload that we have in our site to manage the things. And we can focus on other stuff that we are first more interested to do and then on the second that are more interested also for the service. So the status of the automation right now at CERN is covered on these four areas, part of automation. And then it's like we have the host and service monitoring that's basically taking care of the alarms. And the service also the hardware events and the service alarms that we have in our service. We have the integration with the resource life cycle management because it is resource that we have a CERN needs to be tracked and needs to be properly handled. We focus on optimize the resource availability and to keep enough free resources for our users to deploy the applications. And we want to improve the availability of VMs. So yes, everyone knows that we shouldn't put pet on hypervisors. But for some reasons, there are still users wanting to do so. We need to improve availability and performance there. If we go to the service monitoring, we are collecting all the events that are in the hardware with collectee. And also we are collecting the service logs through flume. All of them, they get piped into what we call the CERN general notification infrastructure that behind the scenes is running a CAFQ. So it's like they get piped. And then that service generates tickets. For example, one case, trippers. We also have alarms for the service in Grafana. And all these alarms and tickets, they are managed by the tool I mentioned before, RANDEC. This is the quite important and crucial for us because we can run different types of jobs there that do certain or specific tasks. For example, series of CERN jobs or event-based jobs that are targeting specific events. And then they will fix the issue when it happens. We also have offload into other teams. And I will show it later, in which we offload some of the jobs, these more administrative tasks, to other less knowledgeable teams to do the action that they want. And we use this tool to schedule interventions in the future. So then, for example, if this needs to be replaced, we can schedule the intervention in advance, notify the user. So all these machinery we need to put in place to notify to do the interventions can be automated. And this is what we are doing. So RANDEC is quite crucial for us. And we use it heavily for delegating tasks. We rely on RANDEC for offloading tasks on these four teams. For example, the procurement team that is the team that is responsible for adding services and removing services from the cloud. That's the guy that arrives. And OK, I have 200 servers that I am going to install. I need to handle to you. And what we did, what we have in RANDEC is basically a task in which they can run. And they will add those services into production. So be able to be used by us. The repair team has a different, more or less the same access. It will be used, for example, in example, that I put below with this replacement. If it is an alarm produced and handled by colleague D, it will be piped. A ticket will arrive in service now. It will be picked by the repair team and will start doing the intervention to replace the disk on the server. But it's not only sending a notification to the user that something is going to happen in some period of time. It's also draining the host and ensuring the machine will be stopped by the time he wants to replace the disk. So that increases the efficiency and their efficiency to change the components. Another key actor that we have using RANDEC is the resource coordinator. This person is handling all the resources that we have in the data center. So every project request, every quota change, the kids pass through this person and we'll approve it or deny it. Then what he does, basically, after his approval, the process of the application of the quotas, there is completely automated. So basically, I'm not creating any projects since a while ago. And we are also using the service for common tasks. Basically, if we have a known error and we have to work around, we have it there in the list of tasks that we can run in case this recurs. Now I'm jumping to the other side of the graph. All the resources at CERN, they are tracked. As the unit of ownership in OpenStack is the project, so the contract that we have with this engine is that we need to have a lifecycle first. We need to have a lifecycle for the project. And second, we need to have an owner in each project. So we can track a person that is going to be responsible for all the resources that are running. Basically, this person is the owner of the project that is translated into a role in Keystone. And then there are two types of projects. They are the personal and the share. So when you subscribe, you get automatically a personal project with a small quota out of the box. And then if you need more resources or you want to run, because personal is for test and development. If you want to do more stuff and run your services or your applications, what you create is a share project. The only difference between the two, because in terms of OpenStack, they are the same, the only difference between the two is what happens when the user leaves the organization. As we cannot afford to lose resources, we trigger some actions. And these are basically the actions that are run. So when the affiliation expires, the user doesn't have any contract that's earned anymore. What it was supposed to be production gets promoted. So then the resources and the applications are not lost. In the personal space, the personal area, we don't do nothing. And we wait until the user account gets disabled. At that moment, what we are doing is with every other resource at CERN, we block the access, and we stop the VMs and all the resources. So in the rare event that a user is running a production workload in a personal project, we can track it over there, because we have at least some time between the two events. And then when the user is deleted, we clean up the stuff. All of this is done through Mistral workflows, in which all the plumbing between dependencies between different services is managed inside. So what it looks behind, basically, is a set of warbooks in Mistral that are interconnected. So we have the project, creation, retrieval, update, and deletion. And we also have the service-oriented part. If we have an example with the project deletion, this is what it looks like. It will check first that the project exists, and then we'll pass to the, let's say, the service deletion or service resource deletion, in which we have a kind of a graph like this one. It will do all the things in the top, going lower into the lower layer, until it cleans up all the stuff. And once it's cleaned up, then it can do the deletion of the project. For the end user, what we have is basically a button in horizon. So I'm putting the example here for the project creation, because the project deletion is like, do you want to delete the project? Yes, and that's it. The project deletion, the project creation, what it shows up, is a form with all the data that the resource coordinator needs for assessing, to assess, if you are capable to run these resources. And if you are, you can decide if it approves or denies this operation. Once you click on Accept, it will create a ticket in the snow assigned to the resource coordinator. Then he will trigger, we put also a link to approve if he approves the project, that trigger the project creation through RANDEC, because the interface between the ticketing system that is snow and the automation is done through RANDEC. It will trigger the creation in Mistral. And if we jump to the third point in which we need to maximize the resources that we run at CERN, we cannot afford to run resources that are idle or they are not doing anything. So we recently added an expiration to the VMs. So each VM that is now running in the personal project has an expiration. And this is set shortly after creation and it's evaluated daily. It's configured to 180 days and it's renewable as many times as the user wants. This is implemented in Mistral. And for example, I put in here what happens with an active VM that I extend it for some time, then I don't want it anymore, then they will get a spire. And what happens when it gets a spire, the machine will be stopped and locked. So again, the same policy, we leave a grease period to identify if the machine was useful or not. And then the machine will purge. Do you know what happened when we enabled this in production six months ago? OK, I will tell you that later. I have a different slide. So we recovered 3,000 cores that were idling. So basically, we can use those resources for the future. How is this implemented internally? It's basically every instance has an attribute that we call the expired attribute with a date. And for knowing which are the projects that are candidates for expiration, basically we add a tag to the project. And this is implemented in three workflows that are nested. So then the first one is the global one that's run at midnight. Then it triggers as many expiration projects as projects you have that will trigger as many expired instances as instances you have. It will check if the attribute is fine. It will fix it if the format is not correct or something like that. So if the user tries to modify something and will trigger, depending on if you need to send a reminder, if you need to expire the VM, or if you need to delete it. And all of these, this is in a single workbook in MISTRA. We want to get more performance. And we have use cases at CERN in which the model in which we have the compute nodes and the storage nodes connected over the network doesn't fit. Because what they require is more, so they require more this capacity that we have in the compute nodes. And they also require a small IO latency. So what we are preparing is a setup, an hyper-coverage setup in which we mix the compute and storage nodes. That will have a local Chef pool covering the femoral and also the volume access. And what it gives us is simplify the management because in case of a hardware event, something happens on the hypervisor. We can evacuate the machines and we can do it right now with live migration. And then we can put there the DB and storage guys that are aiming to have more storage than what we offer right now, and also a small IO. If we look at what we are, look at the future and what we are going to do in the CERN cloud for the next steps on automation, we are continuously adding services. So this task over there is basically to improve the way that we add services into the cloud to be more transparent and more easy for us to add more services. We want to empower our users to get knowledge of what is the infrastructure issues that we have behind so that they can do automation on top of that. And for that, we are investigating to add the root cause analysis project behind. So we can tag the instances in which the server, for example, is having an issue behind. We are looking into Kubernetes jobs to add it into the offering, basically moving all the stuff that we have in Randek there. And we want to double down and get more performance and more availability. So I will put an example of adding a new service. We are looking into offering the S3 endpoint through the Rados gateway. Our colleagues of the storage team, they have a Rados gateway that allows us to pipe it and connect it to the cloud. And then you directly get it on the APIs. And what I can tell you is the world books are prepared to be extended. So they have hooks, places in which you can add services. And this is to make it simple to add services on the system. And in order to do that for this particular use case, we are using the admin operations API for Rados gateway. Basically, we need two libraries. One is the Python Rados gateway admin that our colleagues from Switch have the library for that. So basically, we only need to prepare a package for that. We create a wrapper to have these operations in Mistral. This is the other one, the Python Mistral Rados gateway actions. And then the last step, we modify the workflows. And this is how it looks like in the upcoming version to disable a user. Disable a user in Rados is to disable the access of a project to all the buckets that are stored there. And with all this small snippet that gets in the workflow, it will allow us to disable the resources properly and then clean up later. We want to empower our users to get knowledge of what is happening behind these things in the infrastructure. And in order to do so, we would integrate the root cause analysis project. We have several use cases in the past in which a CPU issue or the declaration after installation of a package start to affect some services. And you don't get the knowledge of what is this particular event, what are all the services that are affected by that. And this is something that Bitras provides, because it gives you alarms with the scope. So something is coming. So something is broken in this server, and it's affecting all these services, all these VMs that are running on top. And we can get easily without the feeding up in the databases, which is the service, the all the list of impacted services. On certain cases, if the user didn't provision his setup properly, it allows us to find hidden service dependencies, to find why he's running two VMs on the same host that's probably against his will. And it's something that we can tell him. And to close down and to make it self-automatable, it allows us to trigger automatic resolutions. So for certain cases, we can code the workflow in Mistral and close the loop and provide a healing operation for the end users. So if we come back to the issue with the disk, the colleague that will generate the alarm will be picked by Bitras and can notify all the nodes, all the VMs that are running on the service. Imagine that this Kubernetes cluster with some, so you have a minion there that can notify the Kubernetes. So you can notify and evacuate the workload aside, so then the intervention will be more transparent for you. We are looking in the Kubernetes jobs. We have two steps here. Basically, we are moving the control plane to Kubernetes. We are not yet there, but we are moving though. And this is based on help charts. And all the healing operations, we can codify them or code it as jobs in this cluster. This is one side. On the other side, so we can, all the tasks that we are running right now in Randeck, we can dockerize them. So we can put it in docker and execute it easily. If you get a bonus point, now Randeck interface with Kubernetes. So it's a perfect match. So we are moving with the tasks that we have in Randeck into Kubernetes. So for the hyper-converged servers set up, I was mentioned before, if you follow the guidelines on how to deploy these services, you need to leave some fixed CPU allocation to ensure that you have enough IO. Because if this happens, like in this one, that this node is fully committed on the CPU, he may not have enough spare cycles to cover the IO operations that are done there. And what we want to do there, and this is what we were evaluating during the summer, was dynamically adjust the usage and move the things around. Because on this service, we can do live migration. So we can automatically move the things around and keep the free resources for IO. Enough resources, so we don't need to fix the allocation to the setup that is on the guidelines without avoiding impact on compute. And for that, we were using Watcher. And to go even further, we can use the empty spaces that we have in the IPervisors. We cannot afford to have idle resources. We can use this empty space with spot instances, with printables. So then we can fill the IPervisor memory. And then if the user creates a normal VM, and this IPervisor is selected, it will free up space and put the machine there. So then we can get a better utilization of the machines. So if you want to know more about printable instances and spot instances, that is the advert part, you can go to another talk that's going to happen on Thursday in the whole A3 by my colleagues on the Cloud team. And for this particular case, what we can use for scheduling, it's like we can use, again, Watcher to get the CPU load and instance machine spot instances of a certain kind. So I was showing a lot of graphs, a lot of diagrams. You can believe me or not. But I will show you the code. That's probably what you want to. So if you go to this repo, we have all the code that we are using at CERN. So the first line is basically we have the downstream patches of all the projects that we run in this link. We have the workflows that we're showing you for MISTRA, doing expiration, doing project life cycle management, some of us doing certain actions that we offer to our end users to simplify their life. Basically, we have one, for example, for instance snapshotting. It's like the user doesn't need to know if the instance is put from volume or not. It has a button and clicks. And that's it. All of them, they are in the MISTRA workflow side. We have also the code for the Rados Gateway actions on the Rados Gateway admin libraries that we use for integrating Rados Gateway there. We have the request panel. This is the panel that we have in Horizon that just sits in front that has the buttons for interacting with the creation quota requests and so on. And at last, we have the scripts and tools that we run in RANDEC that will be dockerized to be able to be used in the ask governance jobs. And that's all for me. Thank you. Do you have any questions? Corporate. Doesn't work? We can change. Yes. Thank you. Any particular reason to use both RANDEC and MISTRA? Any reason to use both RANDEC and MISTRA? So basically, it's historically. So when we started, there was no MISTRA. But it's pretty convenient. The only thing I'm lacking in MISTRA is the lack of change of scope. Because I may need, in order to delegate those operations many of them, they are doing admin-like operations. And the change of scope in the role-based system in MISTRA is not well done. So I may need to give them admin access on the cloud to be able to achieve the same operations. So basically, the main reason is historically. It was the only thing available at the time that was working well for us. Doesn't work? No. So I was going to ask, how many clusters do you have for supporting this 9,000 compute nodes? Is it a single control plane supporting all these compute nodes, or do you have multiple clusters? And what would be the maximum cluster size in your case? You mean the number of cells? Number of cells are number of compute nodes per cells. There is a next talk from one of my colleagues that is just there that will tell you the number of that. So the thing is, we are running 70 cells. We have at least 70 control planes for each cell, plus another one on top. But if you want to know the sizes, you better ask him in the other one. Can I ask you a quick question? About the Alpure Converge node you are planning to deploy? But so the storage, I didn't get exactly the storage that you are planning to deploy. So what we are, so the Alpure Converge setup is basically a set of machines that have more this attached. So it's like, so we are, so it's machines that we are going to use two disks for the server, for operations, for the system, let's say, the system drive, and the remaining disks are going to be OSDs in a set pool that we're configuring the cluster to be in, so the MD SSD, Mons, they're going to be running on the same rack. So you are not going out of that rack for any SIF or volume operation. So you get pretty decent low latency. OK. And I didn't mention, but this desk cluster is going to be SSD only. So it's going to be even faster. OK. So no more questions. Thank you.