 Let's start. So hello, everyone. Good morning. My name is Belmiro Moreira. I work at CERN, and since CERN IT decided to build a cloud, I've been working in the open stack deployment at CERN. So with this talk, I will give you an overview about the E3 architecture tools and technical decisions that we made at CERN to build our cloud infrastructure. So first of all, with the CERN. CERN is the European Organization for Nuclear Research. It was established in 1954, and lab sits in the border between France and Switzerland near Geneva. The organization has 20 member states, all European countries. CERN is the biggest international organization in the world, over 10,000 scientists from more than 100 countries work at CERN. CERN mission is to do fundamental research. It's like how gravity works, where is the antimatter? As you can see, not easy questions, but all the research that is done at CERN must be published or otherwise made generally available, like the World Wide Web, for example. You are seeing this picture. This picture is one of the examples of the fundamental research that is done at CERN. This is the cloud experiment chamber. This experiment was designed to study the effects of cosmic rays in the formation of clouds. We are really in clouds, atmospheric clouds, to be clear. So CERN operates a network of six particle accelerators, including the world's largest accelerator, the LHC. A particle accelerator is a machine that propels a beam of subatomic particles, and it accelerates these particles at high speed, nearly to the speed of light. Each accelerator in the chain increases the speed of these particle beams before delivering them to experiments for analysis, or to a more powerful accelerator in the chain. The particles in these beams are made to collide, and they collide close to the speed of light. The process gives physicists clues or particles interact, and insights about the fundamental laws of nature. The LHC is the CERN's largest accelerator. It's a 27 kilometers ring, crosses two countries, and is 100 meters underground. LHC stands for Large Adron Collider, and it accelerates particles from two beams that travel in opposite directions, and they collide in four points, the experiments, the CMS, the LHCB, Atlas, and Alice. If you are interested in exploring the LHC and the experiments, you now can go and see them using the Google Street View. It's really cool. The detectors of the experiments are located in huge underground caverns. This picture shows the CMS detector, and you can feel the size of these machines. These machines have the size of cathedrals, up to 45 meters long, 25 meters in diameter, more than 12,000 tons. As comparison, imagine a machine with the mass of the Eiffel Tower built 100 meters underground. Around 600 million times per second, particles collide in these giant detectors, generating even more particles, and a detector like this is like a digital camera. They take 100 megapixel pictures, but 40 million pictures per second. This produces one petabyte of raw data every second. Of course, you cannot handle all of this data, so experiments have trigger systems that select and detect the most interesting events, like this one, for example, and in real time they select them. After the last trigger level has been applied, still several hundred megabytes of data needs to be stored. Per year, CERN stores around 30 petabytes of data. All of this data is stored in CERN Computer Center in Geneva. It's a very old building from the 70s, full of history, and as you can imagine, it hosted several generations of mainframes and several computer architectures. After several upgrades, now Computer Center has a capacity of 3.5 megawatts, and we have around 91,000 cores and a storage capacity of more than 200 petabytes. Because we reach the power capacity in the CERN Geneva Computer Center, since the beginning of this year, we have a new Computer Center in Budapest, where we have available 2.5 megawatts and around 20,000 cores. We use mainly this Computer Center for computing, for the LHC jobs, and also to replicate our services into a remote location for resilience. The two centers are separated by more than 1,000 kilometers. They are connected by two fibers, independent fibers, that support a bandwidth of 100 gigabits per second. All the infrastructure management is in Geneva, and we only have the necessary personnel in Budapest for the necessary repairs. So not too long ago, in 2011, we had 10,000 servers. These were dedicated physical machines for compute, storage, and to run services. Also we had the typical virtualization project for server consolidation using Microsoft tools. It was a very popular service, more than 3,000 virtual machines, but still very static, and we managed the nodes using our old-fashioned tools. Other problem was that different teams controlled and configured their own physical resources for each specific use case. Also in 2011, we were reaching the power capacity of the CERN Computer Center in Geneva. And we knew that in the beginning of 2013, we would get a new one, and these raised several questions in the infrastructure side. How could we manage twice the number of servers without increasing the manpower? With all tools, developed more than 10 years ago, scale to the numbers that we were aiming for the next four years. Well we are talking about around 15,000 servers, around 300,000 cores. Well it was clear that we needed to improve the changed way we managed the computer center. So we start to look at how to build the cloud infrastructure with three main goals in mind. So improve all machines are managed, improve all machines are used, and also improve the response time to our users. To build the cloud infrastructure, we needed to identify a new tool chain, a configuration management tool, at the time we selected puppets. A cloud manager tool, well two years ago, we selected OpenStack. And for monitoring and storage, it wasn't clear at the time. So for monitoring, we decided to keep our monitoring solution that we developed. It's called Lemon. And for storage, we decided to evaluate the different solutions available and decide later. Well also in 2011, it wasn't clear how to deploy OpenStack to thousands of nodes. So it was clear that we need strategy. And the strategy was basically to follow the community. And use the puppet infrastructure that we were developing at CERN. Use the OpenStack puppet models that the community were building. As operating system, use SLC, stands for Scientific Linux CERN, and basically is a red art clone that we recompile. And use the packages from the distributions and community. Not all pieces were there when we started, two years ago. The CERN puppet infrastructure was under development. We only had RPMs for Fedora 16. So we decided to deliver a series of preproduction infrastructures to gain experience and to test functionality. Our aim was to have a production service at CERN in second quarter of 2013. Well we had three preproduction infrastructures. Because we expected to release and tweak rates much faster than the OpenStack cycle, we gave them cool names, also because it's cool, animal names. So we started with a very small and fragile animal. And then we increased the animal strength like our release feature sets. So our first infrastructure, we call it GOPY, it was released in June 2012. And it was based on OpenStack Essex. Well it was deployed on Fedora 16 at the time, using the first puppet models to configure OpenStack by the community. And we use it basically for functionality tests. Without any integration with the CERN infrastructure. Our second release was Amster. And this one was based on OpenStack Folsom. Well we deployed this infrastructure on SLC6 and also on Microsoft IPV. And we started integration with CERN network and CERN identity management. Our last preproduction infrastructure was IBEX. Also based on OpenStack Folsom, we opened this infrastructure to a wider community because we wanted our users to test their applications in this new environment for them. We in this infrastructure added more than 14,000 cores. And finally, in July 2013, we released our production service at CERN. And we were really happy with the Riesli name that community choose. Well these are some screenshots of Horizon. I put them here because we did some customizations for our infrastructure. Well nothing really major. So you can see two buttons in the login page. One is to subscribe to service. So when CERN users goes to the OpenStack website, you will see this login page. It doesn't already subscribe to service. It clicks in that button and is redirected to the CERN service web page where you can subscribe to the call service like any other service at CERN. The other button is basically L button that redirects the user to our user L page. And in this side, we have a normal admin view and we have a link to submit a ticket for our ticket system if the user founds any problem. Also you can see some other projects. It's really difficult to see. One of our major clients, major projects here is the batch project that processes the LHC jobs. So what are the highlights of our infrastructure? We are using cells. We have two children cells, one in CERN Geneva, the other in Budapest. We have an eye available architecture. We are using cellometer and we integrate OpenStack with our CERN network and the identity infrastructure. We are monitoring all the OpenStack components and we are using CERN as back-end for Lens and Cinder. Well, since we have deployed Grizzly three months ago, we are adding about 100 compute nodes into our infrastructure per week. This means that right now we have around 20,000 cores in the infrastructure and we have more than 2,500 virtual machines. If you see the number of cores and number of virtual machines, it's not so much virtual machines. The reason for that is because our batch system uses large VMs with eight or more cores to process the LHC jobs because architecture of the batch system. This is an overview of our architecture. So you can see on front we have load balancers that are located in Geneva and then we have the top cell controllers also in Geneva and then we have the two children cells, one in Budapest and the other one in Geneva. You can see that we have a normal architecture with the controllers and the compute nodes. So these are components that we are running in each cell. So on the top cell, we are running basically the NOV API and cellometer API. We are running Glance on the top cell with Glance API and Glance registry and also Cinder. Also we have Keystone for users, authentication and also Ryzen. On the children cells and controllers, we have basically the NOVA services, Conductor, Scheduler, the NOVA network. We are running Glance API at cell level because we want to keep image cache at cell level. Also we are running Cylometer agent, central and the collector. The way we support Cylometer in our infrastructure is that we run the services in the controllers of the cells and they share the same database. Also we have Keystone for the Cylometer authentication basically and other services. And the compute nodes are nothing really interesting. So we have NOVA compute and Cylometer agent compute and on this side we have all these components that we also run in our infrastructure like HDFS, Elasticsearch, Kibana and MySQL among DB. So we are using two different hypervisors in our infrastructure, KVM and HyperV. The reason for this is because we want to run Linux on top of Linux and Windows on top of Linux. This is not really for performance reasons, it's more for support. If the vendor controls all the stack it's easy to have an answer. And also because we want an easy path for the migration from our previously virtualization infrastructure now to open stack. Same infrastructure is deployed using puppets including Windows compute nodes. And the other interesting fact is that VMs running on open stack can also be puppetized and they share exactly the same infrastructure like the physical resources. So we are using AJ Proxy as a load balancer, each cell has at least three masters and we have three availability zones per cell for users to spread their applications. As message broker we are using RabbitMQ and we have at least three brokers per cell and Rabbit is configured cluster with mirror cubes. Well as a database we are using MySQL. We have MySQL instance for each cell, they run on top of Oracle CRS and we have an active slave configuration for the databases and the storage we use NetApp and we do backups every six hours of everything. So now I'm going to talk about the open stack components, how we are using them and some problems that we faced. So let's start with cells. Cells in Grizzly are experimental. The decision to use cells from the beginning was to have a scalable architecture to support thousands of nodes and to offer the resources of both computer centers transparently to our users. The first thing that we noticed when we moved to cells was that they lack some functionality. Things like security groups and line migration, they are not available in cells using Grizzly. Other problem was that we use aggregates for defining availability zones. And an aggregate is defined with a set of compute nodes. But because the top cell doesn't know anything about the compute nodes that are running in the children cells, they cannot be defined in the top cell. As a consequence, availability zones are not shown correctly for VMs. So we need to populate the database of the top cell manually to have the correct availability zone shown when a user does not show the VM, for example. Other problem is flavors. For some projects, we have private flavors for some special use cases. And we need to define the same flavor in each cell. It will be really nice if we only define the flavor in the top cell and this flavor is propagated to all the children cells. One of the biggest challenges that we faced was the scheduling in cells. So at the moment in Grizzly, cell scheduling is completely random. So if you create a virtual machine with a multiple cell environment, it could end up in a different cell, in a cell that you don't control, basically. You don't know where it will be created. So this means that for us, this was a big problem. So we implemented a very small scheduler that selects the cell based on a project or based on cell-available memory. Also cell communication doesn't support multiple rabbit servers. So it's a single point of failure. And beside of these problems, we are really happy with the functionality of cells. And most of these problems are now fixed with Havana. We are still running Nova Network. Let me first give you an overview of how CERN Network infrastructure works. So all the devices in order to join CERN Network, they need to be registered in the CERN Network database with a Mac and a static IP. This is true for laptops, desktops, mobile devices, and of course for all servers in the computer center, but also virtual machines. This means that Nova Network needs to know about these Mac, IP pairs, and also about the network topology because virtual machines cannot run everywhere. So what we did was to implement a driver for Nova Network that talks directly with our network database to get all this information and to do these operations. For Neutron, we are still evaluating how can we deploy it in our environment with a minimum service interruption. Nova Scadler, we have one Scadler per controller in the children's cells, and we have several filters. I'm going to mention two of them that we are using. So one of the filters is the image properties filter. And the way we start only Windows images in Windows Compute Notes and Linux images in Linux Compute Notes is because of this filter. So for example, when we upload a Windows image, we also define the image property, hypervisor type equals IPRV. And then Scadler only selects the IPRV compute notes to run this virtual machine. Other filter that we are using is the projects to aggregate filter. This one was developed by us because we have projects that need dedicated compute notes, specific compute notes. So to allow this, we create an aggregate. We add these compute notes, and then we specify which project or set of projects can use that aggregate. So if a virtual machine comes from that project, the Scadler will only select that compute notes that belong to that aggregate. Availability zones. Percell, as I said, we have three availability zones. And in OpenStack, a user can define when he creates a virtual machine where he wants the VM in which availability zone. The problem is if he doesn't define the availability zone at creation time, the virtual machine will go to a default availability zone that could be configured in ova.conf. So for us, this means that we need to have a big availability zone because normally a user doesn't specify in which availability zone he wants the VM. So we made a simple change that instead we have default availability zone. We have a set of availability zones where we put all our availability zones there. And then it selected the best availability zone at that moment for that node. So Nova Conductor. Well, Conductor is new in Grizzly and is great for all security reasons and because it reduces dramatically the number of network connections to the DB. So dramatically that at the beginning it was a bottleneck for us. So we had one Conductor for controller in each cell but it wasn't enough. So to fix this, we need to backport this code that now is in Havana and that allows to start multiple conductors, workers in the same server. Nova Compute. Well, I already said that we are running KVM and Microsoft Hyper-V but what I really want to highlight in this slide is some functionality that the Hyper-V driver in Grizzly still misses. So console access, metadata, resize, ephemeral disk and salometer still are not there in Grizzly. Some of these features are now available in Havana. The Identity Service. Well, at CERN we use Active Directory. We have more than 40,000 users, around 30,000 groups and we have around 200 arrivals and partures per month. When we started this project, one of the main priorities was to integrate OpenStack with our Identity system. So we are using the LDAP account for this to communicate with our Identity system and also at that time we were one of the major contributors for the LDAP account on Keystone. Well, when a user wants to use the call service, he needs to subscribe to the service. When he subscribes the service, click on that button, what happens is automatically a personal tenant is created for this user with a limited quota for him to test the infrastructure. Also we support shared tenants that are created by the admins under request and these shared projects are basically for services. So having around 200 arrivals and departures per month, for us it was fundamental to define the project lifecycle to avoid having orphaned resources in the infrastructure. So when a user lives, the personal projects and all associated resources are deleted and if a user belongs to a shared project, also is removed from that shared project. Celerometer. We deployed Celerometer in our infrastructure. We don't build our users, but we use all the information provided by Celerometer to adjust the project quotas for our users. Well, we are using MongoDB as a backend. This database is sharded and replicated. The collector and central agent are running on the controllers of children cells pointing to this database and also the computed agents. For the image service, we are still running Glantz API version 1. The reason for that is because Python Glantz client doesn't support completely the version 2 yet. So creating an image is not supported if you use the version 2, the Glantz API version 2 with Python Glantz client. Because we are using the version 1, we still need to use the Glantz registry and it's running on our top cell. As a backend for Glantz, we initially used a distributed file system, IFS, and then we moved everything to CEP. In our infrastructure, we offer a small set of public images for our users that we maintain, and it's really difficult to make only the latest update set available to our users. If we remove the old images, then operations like resize, a lot of migration will not work. So we keep all the images and we had a timestamp when the image was created, which is not a very elegant solution. Well, users can upload their own images into the infrastructure, but because they don't pay for storage space, we really need quotas per tenant on Glantz. And now this is available on Avana, too. So for Cinder, we are in the process to deploy Cinder using CEP as a backend. For that, we need to use the Kimu KVM package patched by Ink Tank to support RBD, and the other problem was that Cinder in Grizzly doesn't support cells. So we needed to backport a lot of code from Avana to add some glue to have it run in Grizzly in our infrastructure. Cinder backend and also Glantz is CEP, and we have an instance of three petabytes available. Well, we did several performance tests with this CEP instance, and what I show here is a normal FIO test inside a VM with a volume. Our compute nodes don't have dedicated network for storage, and they have network devices with only one gigabit, and considering the results, we are happy with CEP performance. When we moved Glantz to CEP backend, we tested it with several images, and everything worked fine until someone uploaded a very big Windows image, and it failed without any log error message, and then the debug process begun. So it turns out the problem was because SLC, the fault number of processes limit, is 1,024, and CEP can create one process per OSD available, and then fails silently. Well, the fix is easy. You increase the number of processes, but of course it will be useful if CEP was less processed, really. For monitoring, we are still using our old tool called Lemon. Well, this is the traditional monitoring, see the CPU usage, memory usage, network usage, and we use this tool for physical servers and also for virtual machines. But for us, it was also important to monitor the state of the open stack in each compute node. So for that, we are reading and analyzing all the log information for all the open stack components. So we are using three open source tools for this. We are using Flume, Elasticsearch, and Kibana. Flume is the transport layer, Elasticsearch is the search and analytics engine, and Kibana is the visualization tool that consumes this information. So the log files are parsed and sent to the Flume gateway, then we send them to HDFS, only for log preservation. That's it. And then we also send them to Elasticsearch, where all this information is indexed, and then we visualize all of this with Kibana. Kibana dashboards that we design for all the open stack components. So dashboards like this one, for example. So this year you can see the number of API requests that are eating our servers. Also you can see the time that each of the requests took, and also the average of each API server that took to handle the request. Another example is this one, where you can see the amount of log messages in the system, info messages, warnings, errors. And the second plot there, it shows which components are generating more logs. And you can see that the component that is generating more logs there, it's no API. So what are the challenges for the next month? So we want to move all the resources that supports the LHE physics into open stack infrastructure until 2015. This means that per week we need to add into our infrastructure more than 100 compute nodes. Now with Havana, of course one of the challenges will be to migrate all the infrastructure to Havana without service interaction. We want, we need to start using Nielsen, if no network is deprecated. Also we need to deploy it to have some orchestration in our clouds. In the particle physics community, Kerberos and X509, it's used for authentication. We want to have that in our infrastructure with a Keystone external authentication that will be possible. And also we want to have configure Keystone domains at the moment we don't have. So thank you for your attention. So I'm happy to answer some questions. So yeah, I'm sorry, I cannot hear you. There is a mic. I was wondering if you can say something about how this is that at CERN are actually using open stack infrastructure. What sort of interfaces did they develop in order to actually do this? Right. Okay so we open our infrastructure to the experiments. So large collaborations at CERN. LHC Atlas, CMS, LHCB. And they are interacting with our infrastructure mostly of them using EC2 interface. Some of them are using Delta Clouds to interact with our infrastructure, so basically EC2. And then yes, then we have a lot of individual users that use NoVAPI to interact with the infrastructure. So they're using the cloud infrastructure as if there were individual nodes and so they're not using any special open stack specific things. So they use the cloud infrastructure mostly to run their workloads, so the LHC jobs, businesses jobs. How does Keystone perform with such a large active directory set of users? It seems like a lot of users. Do you notice any performance problems? Do you have to do any tweaks to make it perform better? Yeah. So with a lot of load sometimes. Well, but we don't have so many simultaneous users using the system. So at the moment it's fine. Also, I show in the slides that we have the Keystone for users and also Keystone for services. Because we are not caching the tokens at the moment. So for Cilometer that is constantly querying the API, we have a separate Keystone for that. Yes. So Ceph has a back-end for the IU nodes, especially for APA? No. No. Ceph at the moment is used for Glance, back-ends and for Cinder. The Cinder part is not quite yet. It will be in the next weeks. For all the LHC jobs, we are using our storage systems that we develop. Not really. It's storage systems that we develop at CERN. Caster and EOS. That is shared by all the LHC community. Some kind of question for your cluster computes. Or? Your cluster computes. Yes. These Ethernet or InfiniBand? Ethernet. Hi. You didn't mention the network control. So this means that you are using a flat network for your infrastructure? Exactly. And based on VLAN, I mean? No. Just flat. Just what? Completely flat. It's our configuration. Okay. Yes. I noticed that you're doing a lot of back ports from Havana into your current production infrastructure. I'm curious, what are there any limitations that you've run into that have not been supported in Havana, basically? Are there anything that you really want that has not been released in Havana yet? There are some features in cells that are not in Havana yet. For example, flavors, these propagation that I mentioned. Right. And the aggregate parts, as far as I know, are not in Havana yet. Okay. And that for us is really important. Okay. Thank you. The scheduler part was improved in cells. Maybe it will work for us now. Okay. So between two cells, do you replicate storage at any given time? I see like hundreds of servers being migrated either to the New Orleans or whatever. Do you replicate storage between the two sites? Replicate storage. So, no, we don't have replicated storage. Okay. So the LHC jobs that run in Budapest, they get data from Geneva. Okay, cool. Oh, yeah. I don't know if we have time for more questions. So let's check the time. Are we good on time? Yeah. Yeah. Do you deploy Hadoop cluster to process big data? Hadoop cluster, do you deploy in your system? Hadoop. Hadoop. Yeah. Hadoop. Yeah. So at CERN we have different teams and we have a storage team. And they have a large HDFS installation. So basically we didn't set up anything. We used their HDFS infrastructure. So in terms of how they set up that. How much storage size for HDFS storage system? How much do you have? The storage for HDFS. Yeah. I believe it's really, really low because we don't have so much logs. Okay. Yeah. Thank you. And use it only for that. Yeah. Well, if you have more questions, I'll be here all week. So talk with me. Okay. Thank you so much.