 So, hello everyone. So, my name is Belmiro, I'm from CERN, and with this talk I'd like to show you how we are running OpenStack at CERN. So, as you might know, CERN is the European Organization for Nuclear Research. It was created in the 50s and is the biggest international scientific collaboration. The lab is in Europe and is located in the border between France and Switzerland, very close to Geneva. CERN's mission is to do fundamental research, basically look into difficult and fundamental problems that led, for example, in 2012 with the discovery of Tiggs boson. So, for all these fundamental research CERN provides different facilities to scientists. For example, particle accelerators like this one in the picture, this one is the large Hadron Collider, and it's the biggest mission in the world. This is a ring with the 27 kilometers in diameter. It crosses two countries and it is 100 meters in the ground. It accelerates two particle beams that travel near to the speed of light and they collide in four different points where we have detectors to detect these collisions. Detectors like this one. So, this one is the CMS and these machines are huge. They can have up to 45 meters long, 25 meters in diameter, and they can weight more than 12,000 tons. And all of this is 100 meters in the ground. So, inside these detectors, particles of the two beams collide, generating even more particles. And a detector is basically a digital camera that takes 100 megapixel pictures, but 40 million times per second. This generates a lot, a lot of data. So, and with those pictures, we can have a representation of all the collision events. The analysis of this gives scientists clues about how particles interact and about the fundamental laws of nature. So, to process all this data and to support the research of scientists all around the world, there are two data centers. One is in Geneva and the other, it's 22 milliseconds away in Budapest. And we now are running OpenStack in both of them. So, how big is the CERN cloud infrastructure? So, today we have 5,000 compute nodes. This is rough when other 30,000 cores. And on them, we are running 16,000 virtual machines. It's not, compared to the number of cores, it's not a lot of virtual machines because we have very large virtual machines for data analysis. And unfortunately, our users don't have the illusion of unlimited resources, like in public clouds. For new VMs to be created, others need to be deleted. And you can see that in the histogram that we have there. So, with green, we have the creation rate in our cloud and that red is the deletion rate. So, you can see that they basically match. So, we are almost full of this. So, a little bit of history about the OpenStack cloud infrastructure. At CERN, we started with OpenStack in 2011. By 2012, we had our first test infrastructure. It was based on SX. And we had only 500 cores at the time. And we opened it to only a few users at CERN to test functionality. Then we had two more iterations where basically we destroyed the old cloud and we set up a new cloud based on everything that we have learned, adding more functionality and more capacity. By March 2013, it was our last test infrastructure. It was completely integrated with CERN infrastructure, network, active directory. And we added at the time 14,000 cores. So, what we did next, basically, we deleted everything again and we set up a new cloud. So, this was in July 2013. And we tagged it as production for us. And what this means? This means that it was open since then to all the CERN users. It was ready to run all IT core services. We don't do any more destructive upgrades like in the past, destroy everything. And it's ready to run VMs for data analysis. So, if you want to know more about our first production infrastructure in 2013, at the Hong Kong summit, I gave a talk describing how awesome that infrastructure was. You can watch the video. However, since during the last two years, a lot of things change. And it changed because we started running more projects. We now are running also Eat and Rally. We increased the capacity several times. At that time, we had only 21,000 cores. And now we learn a lot how to manage open stack at scale. So, because of that, we have been changing the architecture of our deployment. And that is what we're going to see today. So, this is the evolution of number of VMs running in our cloud during the last two years since we targeted production. So, you can see that we went from zero to 16,000 VMs. Well, the last month that I have there is September. So, now we have more than 16,000. And the first spot is the active running VMs. And the second one is the total number of VMs created over time. And you see that in September, we reached VM 2 million. That for us was quite an achievement. Okay. So, let's start talking about the infrastructure. So, this is only a brief overview about our infrastructure. So, we have one region, two data centers, and 26 cells now. In OVA, we have only a J architecture in the top cell. And by a J, I mean servicing active and RabbitMQ cluster with the JMQ queue. The children cells control plane are VMs running in the normal infrastructure. We don't have a second cloud to run the control plane. We still use Nova Network. We run two different hypervisors, KVM and IPV. Three operating systems, SLC6, CC7, and Windows. Two safe instances, one in Geneva, the other in Budapest. Keystone is completely integrated with CERN infrastructure. We run all these open stack projects now, and we deploy everything using the upstream puppet modules, and the RPMs from our TO community. Okay. So, let's go now through all of this. So, what we have here is a representation of the CERN architecture. The big squares represent the two data centers that we have. In Geneva, we have a safe and a DB infrastructure there. We also run all the open stack projects in Geneva, safe, Keystone, Glance, Cinder, so on. We have the load balancers also in Geneva, the Nova top cell, and then a bunch of Nova compute cells. In Budapest, what we have is a safe and a DB infrastructure, and we only run there Nova compute cells. Only that. So, why are we running cells? When we start in 2013, running cells, it was a huge challenge. At that time, I only knew two other sites running cells. It was Rockspace and Hector. But we knew that to move all the servers into open stack infrastructure, we needed to partition the cloud somehow. At that time, we selected cells. Also, because we wanted to offer only one end point to our users, cells are completely transparent to users. Also, if something happens to one of the cells, only a small part of the infrastructure is affected. This means that we have a separation of the failure domain. Also, we have different hardware for different use cases. Cells allow us to have them completely isolated, and even having different configuration per cell, like a different schedule policy inside the different cells. However, if we are using cells, we also lose some functionality. For example, security groups, they are not working with cells, aggregates and availability zones is a little bit tricky. I'm going to show you how we are set up in availability zones later. The cell scheduler is limited with cells, and CELAMT integration is a little bit tricky. These are only some examples. Okay, so now I will talk about the architecture of some of the open stack projects that we have running at CERN. Let's start with NOVA. Basically, this is our architecture. Let's start with the API nodes. We have the API nodes that run NOVA API. Then we are running a few of them. Then we have the top cell controller that runs NOVA cells and Ravi TeamQ. In this case, the Ravi TeamQ is clustered with the mirror queues. Then here we have the child cell controller. We only have one per cell, only one. They run all the services that you will expect, NOVA API for the metadata, but also for CELAMT. I'm going to explain this later. NOVA scheduler, NOVA conductor, network and NOVA cells. Of course, we need to have a private there. Then we have a bunch of compute nodes that connect to this child cell controller. Then this child cell configuration, we repeat it over and over again. We have 26 cells now. It's a very simple architecture for NOVA. The top cell controller runs in physical machines. All the services are active there. Rabbit is clustered with mirror queues. NOVA API nodes, they are VMs. They run in exactly the same structure with all the user VMs. We don't have a separate cloud. Of course, to start all of this, we needed to have at the beginning at least one NOVA API running in a physical machine. For the children cell controller, we only have one controller per cell. So this means we don't have a J at all. So what happens when the cell controller dies? We simply replace it. If you watch the talk from 2013, at that time, I said that we had at least three controllers per cell. However, if you are adding more and more cells into your infrastructure, this is really hard to manage. In fact, at that time, we had more problems having this AJA infrastructure than now. They were mostly related with network partitions. Now we have this architecture, only one cell controller. Then we have the 200, that is our new magic number. So we only have the maximum of 200 compute nodes per cell. This also means that it's only a few nodes. So if one cell goes down, only a small part of the infrastructure is affected. So how we schedule VMs through all these cells that we have? So we have 26 cells, and not all of them are the same, because they have a different hardware type, a different location, network configuration, hyperviso type. So we expose all these characteristics to the cell scheduler using cell capabilities. To use them, we develop the set of cell schedulers that explore these capabilities. All of this is available on GitHub. If you're interested, you can look into the code. And let's see some examples now. So some cells are dedicated to specific projects. So we need to manage which projects can have access to them. So for that, we consider two different sets of cells, what we call the default cells and the dedicated cells. So how this works. So let's see an example. So in Nova.com, we define the default cells, in this case cell A, B, C, and D. And as a dedicated cells, in this example, I have cell E for the project UID1, and contains also the project UID2. And cell F, I can only run the project UID3. So if I'm using the project UID1, all my requests to create a new VM will go to cell E. However, if I'm running with the project UID4 that is not in this dedicated list, it will go to one of the default cells, in this case A, B, and C. This is quite handy to have. Also, this solves one of the problems that we had in the past with cells that was disabling a cell. Disabling a cell means, in this case, removing it from the scheduler, but continue to do operations on it, like restart a VM, delete a VM. So it's completely transparent to the user. The users are not only able to boot new VMs in that cell. And for us now, to disable a cell is basically removing it from the list. So this filter that we use here, it's also available on GitHub. So in our infrastructure, we don't expose cells to the users. What we expose, it's availability zones. And we have three availability zones in each data center. Each cell, it's only one availability zone, but an availability zone can contain multiple cells. However, with the current cells implementation, it's not really straight forward to configure availability zones, because the aggregates, they are not propagated to the other cells. So how we do it? So first, what we do basically is to create everything in the top cell. So we create the aggregates with the availability zone metadata in the top cell. Then we create, we have all the Nova compute nodes in the top cell. We do a normal db operation for this, adding all the nodes that we want. The API will not work for this, because the service is not there. Then we need to create a fake, at least one fake service per availability zone. And it needs to be available, because otherwise Nova will see these availability zones as disabled. And basically, it's this. And then we use a cell scheduler filter to send a request to the right cell. We don't use aggregates at all in the children cells now. So in 2013, we set up part two first cells. So one in Geneva, the other in Budapest. And during some time, we could not have more cells into our infrastructure. Because two problems. One was that the cell scheduler was very limited at the time. And the selection with cells were random. So we could not control when I create a VM, if it was, if it went to cell A or cell B, it was not possible to control that. Other problem was that we didn't have a solution yet to expose the availability zones to users using different cells. So we continue to add more and more compute nodes into the existing cells. So in Geneva, we end up with one cell with more than 1,000 nodes. And when you have this huge cell, if something happens to that cell, it has a huge impact in infrastructure. Also, we have all the availability zones behind of that cell. All the IT core services were running behind of it. We had dedicated hardware for specific projects in that cell, multiple hardware types. And we are running KVM and Hyper-V in the same cell. So this was, as you can imagine, really, really hard to manage. So, unfortunately, in OVA, you cannot live migrate VMs between different cells because otherwise it will be easier. So we create new cells and we live migrate to the instances. That is not possible. And to be fair, even if that was possible, our network model will not allow it. So the solution that we found was basically to divide this huge cell into smaller cells, considering all the different characteristics that we have inside of this. So basically, we divide this huge cell into nine new cells. So tomorrow, I will talk in one of the operator sessions explaining how we did it with some detail. This is what I have here today. It's only a brief summary. So first of all, if you want to create new cells, you need to identify which compute nodes should go to each of the new cells. Then you need to create the new cell controllers for our new cells. And at this point, we stopped to allowing updates in the current DB. Then what we did was to copy this database for all the new cells. So I have a copy of my original database and then we deleted everything that we didn't care for the new cell. So delete all the instances that shouldn't belong to that cell, delete all the compute nodes that shouldn't belong there, all the network information. So as you can see, this was a quite risky operation to do. And then we needed to go to the top cell and change all the routing path that we have there to point to the new cells. And at the end, we were quite successful doing this. So we divided that huge cell into nine new cells, considering now the availability zones, machines that were dedicated to some projects, now they have a dedicated cell. So if you are interested on this, tomorrow in one of the operator sessions, live migration. Until now, we are not using live migration in our daily operations. However, now we are faced with the two different use cases that we really need to use it. One is the upgrade from SLC6 to CC7. We still have 800 compute nodes running SLC6. SLC6 is scientifically Linux 6. And also we have a large pool of hardware that is in end-of-life and we need to retire it. So this will involve migrating thousands of EMS. So what we really miss is like a platform that can orchestrate this for us, because at moment this is really a manual process and takes a lot of time because we don't want to migrate all VMs that are in one compute node at the same time. We don't want to saturate a particular network segment, for example. And right now it takes a lot of time. Also there is the problem of VMs that have volumes attached. What we have is block live migration. So everything is copied over the network. And the VMs that have volumes attached cannot be live migrated because other block live migrated because otherwise the block device will be copied into itself and that can cause data corruption. We are still looking how we're going to do this. We don't have a solution yet. So upgrade to Kilo. So Kilo drop in support to Python 2.6. And as I said, we still have a tender compute nodes running SLC6. So we needed to build a new RPM to support these use cases since audio doesn't consider this kind of scenario. So what we are using is an original recipe from GoDaddy. So thank you guys for sharing. And the idea basically is to create a virtual environment with Anvil using Python 2.7 from software collections. And for now it's working great. We are still using Nova Network. Like every deployment, at CERN we have a particular network configuration. So very briefly, the network is divided into several what we call network clusters that have several what we call IP services. Each compute node is associated to a network cluster and the VMs running in that compute node can only have an IP from the network cluster associated to the compute node. So this is our model. If you're interested in this other pad, it describes our network model, but also the network model of other clouds, different companies. So if you're interested, it's a very good point. So in order to use OpenStack with our network model, we needed to develop a driver for Nova Network. And the network driver is not only used when we create a new VM. So the model is not only considered when we create a new VM. For example, when you resize a VM or you lie migrate, you also need to consider the network topology that you have because your VM cannot be migrated or resized to any compute node in the cloud. There is a subset that you need to consider that the VM can only be resized or migrated there. So if you're interested in our network, to check and also to could be a good example for your infrastructure. So the code again, it's available on GitHub. Neutron is coming. So we are trying Neutron for a long time, and we are planning the migration for the next months. Our plan is to set up a new cell with the Neutron to gain experience to manage this at a large scale before migrating the existing cells that we already have. We will not have new functionality for our users, at least in the initial phase. So we expect to offer exactly the same functionality that we have in over network now to our users. So we will not expose any API to them for networking. Of course, we needed to extend Neutron to support our network model. Again, all the code is available on GitHub. Okay, so let's move on to a different project, Keystone. So at certain, we have two different Keystone infrastructures, one that we exposed to users and the other it's only used for Ceylomter. Ceylomter does a lot of API calls. So we decided to have a different Keystone infrastructure only dedicated to Ceylomter. We still use UYD tokens and it was generating a lot of load in Keystone. So we decided to separate the traffic to not affect other users. This was in the beginning, we are gaining experience with Ceylomter at the time but we are still keeping this architecture. And the architecture of our Keystone deployment is very simple. So we have on top of the load balancer and then we have a lot of Keystone nodes that run Keystone that connects to the Active Directory and they use the database. The database basically is to keep the tokens. So Keystone nodes are VMs. Keystone is completely integrated with CERN Active Directory infrastructure. And at CERN, we have around 200 arrivals and departures per month. These are different collaborators from different scientific institutions, students, staff members. All of them are potential users of our cloud. So we integrated Keystone with the CERN Identity Manager. And this allows us to automate the project lifecycle. So when a user arrives and subscribes the cloud service, a project is automatically created for him and also some code allocated. When he leaves CERN, he continues to have access to the resources for three months. After that, the VMs are stopped. And after six months, all the resources, VMs, volumes, images are deleted automatically. So moving to Glance. So Glance like in Keystone, we have two different infrastructures. We expose one to users and the other is dedicated to Celerometer. Glance architecture is really simple. So we have the Glance node that runs Glance API and Glance registry. And we have multiple nodes like this. In the past, we had the Glance infrastructure per cell. So each children cell had its own Glance infrastructure. We thought it was a good idea at the time, 2013. However, when you have a lot of cells, that is very, very complicated to manage. So now we remove that and we have a centralized Glance infrastructure. So for storage, on Glance, we use SAF instance in Geneva. Again, all the Glance nodes are virtual machines that run in the shared infrastructure. We don't do any more Glance cache. We really like to have very light VMs that we can replace very easily. All these VMs are also in Keystone are ephemeral. So we can add and remove nodes very easily. Also, in the past, we allow the Glance API to talk with any Glance registry in the cluster. So this looks great. Everything was behind the load balancer. The Glance API could talk with any Glance registry. However, in case of problems, this was really, really difficult to debug. When you have a lot of nodes for Glance, this is really difficult to manage. So now, in the architecture that I show you, the Glance API only talks with the local Glance registry. Only that. And also, in our cloud, our users don't pay for resources. So what we have is a quota system. So we allocate some quota for the project that they are using. However, Glance doesn't support quotas per project. And this is a huge problem for us because we cannot control how much data the user is uploading into our storage system. It will be a very nice feature to have in Glance. So moving to Cinder. So Cinder deployment is also very easy. We have the load balancer, and then we have what we call the Cinder node that runs everything. So Cinder API, Cinder volume, and the Cinder scheduler. And it talks with the three different backends. So Ceph in Geneva, Ceph in Budapest, and also NetApp. And then we have a small private infrastructure for Cinder. So as I said, for the backends, we have Ceph and NetApp. The reason that we have NetApp is because we also have Hyper-V. And now we don't have any Ceph driver for Hyper-V. So that's why we have NetApp to have volumes in the VMs that are running on top of Hyper-V. It will be great to have the driver. Also, we have a large set of different volume types that have different quality of service, backend, location. All the volumes that we have are volumes type are not exposed to all the users. We control everything with Cota. Again, Cinder nodes are VMs that run in the normal infrastructure. And one of the problems that we have with Cinder is that when the volume is created, the server, the Cinder volume that creates that volume is associated with it. And that is in the database. So we have all of these running in VMs because we want an easy way to replace them, these nodes. But if we do that with Cinder, we also need to go through the database and change all the entries to point to the new server, which is really bad. So if you want to know more about our storage infrastructure in Vancouver, we give a talk, the link of the video, it's here. And Ceylometer. So the architecture of Ceylometer doesn't look as easy like the others. In fact, what we have is two Ceylometer infrastructures at CERN. So on top, we have the Ceylometer infrastructure that we use to store all the notifications and all the samples that we use for accounting and the users that have access through the API to all this information. And in the bottom, we have the Ceylometer infrastructure that we use for alarming for EAPT. Okay, so let's go through this. So I have the compute node. The novocomputs and notifications, the notifications go through the cell rabbit time queue. Those are consumed by the Ceylometer notification agent that published that in that central Ceylometer rabbit time queue. And now that is a very important piece of the architecture, that central Ceylometer rabbit time queue. So then all the notifications are consumed by the collector and they are stored in on edge base. In the past, we used MongoDB, but it was really hard to scale MongoDB to the numbers that we have today. So today, we store more than 15 petabytes every three months of Ceylometer data. And we can only keep the data for three months. Then we have the Ceylometer compute agent that also runs on the compute node that sends the samples, the RPC to the central rabbit time queue now. It doesn't go through the cell rabbit time queue. And again, those samples are consumed by the collector and stored on edge base. And then we have the other infrastructure for Ceylometer that we use for alarming. What we do is to send also the CPU samples via UDP to the Ceylometer UDP collector and that one stores everything on MongoDB. Because it's very few information, we keep everything in memory and it's very fast. The reason that we have two different infrastructures for Ceylometer is because when we start using it and enabling the alarming, the first one was really slow to query for alarming. It could take a few minutes, depending on the query you do, to get an answer. So we decided to have a very small infrastructure for Ceylometer where only we have the CPU samples and everything is in memory and we only keep the data for a few hours. So it's extremely fast to query. When you have thousands of virtual machines and thousands of nodes, Ceylometer can add significant loads into your APIs. We have more traffic in our APIs from Ceylometer than all our users. So that's why we decided to have different keystone infrastructure, different lens infrastructure. And in fact, we also have different NOVA APIs for Ceylometer. So in each children's cell, what we have is NOVA API running that runs for the metadata and also it receives all the calls for Ceylometer. And in this way, we split the traffic of Ceylometer through all the cells that also queries only the cell database. What you can see in that plot is basically the number of calls that we have in Ceylometer every hour. So every hour only Ceylometer calls around 130,000 times NOVA APIs. So in the past, Ceylometer used the cell ribby team queue of each cell for the notifications and samples. Do you remember that I said that new ribby team queue, the central one, was a very important piece now? And it is now because of this. So in the past, we used the ribby team queue in the cell. However, when we had a slowdown on HBase in our storage system, the messages started to pile up there because they were not consumed to be stored. So the ribby team queue in the cells could be compromised of that and we could compromise all the infrastructure because that rabbit could be down considering the number of messages that we had there. So in that Instagram that you see there, this happened in this September now with the new rabbit. So we separated that rabbit. The messages don't are piled up in the ribby team queue in the child cell. They go to that central rabbit and if they pile up there, they will not affect the children cells. So that happened last September and you can see that we had a problem on HBase and you can see the messages queuing up. At that time, we had more than 15 million messages piling up there. And then when the problem was fixed, everything starts to be consumed again. And this was completely outside of novel infrastructure. So moving on, Raleigh. So we are investing some time in Raleigh during the last months. When we increase the number of cells, we also increase the challenge of keeping everything working. And we need to make sure that each individual cell is performing well before and after being in production. So we start using Raleigh not only for benchmarking the infrastructure but also for functional testing. So we have different scenarios to test the cells. We have tests every hour. And for example, the very simple test is to create and delete an instance. And then to have an historical view with Raleigh, we integrated with our Cuban infrastructure and now we have these great IT maps where we see what is working and what failed over time. And also we are interested how long it took each operation in the test. So you see in these plots, we can see this. So this is my last slide. So challenges for the next months. So we will increase the capacity to 200,000 cores by summer 2016. We need to migrate thousands of VMs during the next months. So to operate the 800 compute nodes and to retire all servers, we want to move to Neutron. Also, we are working the federation with other institute, scientific institutes, and we are looking to Magnum and counter impossibilities. So thank you so much. So I don't know if I have time for questions. We are really on the time. Okay. Anyway, I will be around all week. So I think we have time for one question at least. Do you have questions? Yes. It's very if you use the mic because the session is being recorded. A question about data management. You say the accelerator generates a huge amount of data. What's the role of OpenStack and SAF in data management? How does it cope with the bandwidth and what bandwidth can you reach with your infrastructure? Thank you. Okay. So the search system that we have for OpenStack is SAF, but we only use that for block storage and the glance images. We don't use SAF for the data that we collect from the experiments from the LHC. For that, we have different storage solutions that were developed at CERN to store all that data. So what happens is the VMs that scientists create in our cloud, then they connect to these two storage systems that we have and they get the data from there. It's not SAF for that. Okay. Okay. So thank you so much.