 Hello, my name is Belmiro and I'm a cloud engineer at CERN. In this session, we'll dive into the history behind the CERN cloud infrastructure. I had a lot of fun revisiting Aldi's material again. I hope you enjoy it as well. But first, what is CERN? CERN is the European Organization for Nuclear Research. CERN mission is to do fundamental research, especially in the iron energy physics. It's one of the biggest international scientific collaborations in the world. More than 10,000 scientists from all around the world work at CERN. CERN provides the particle accelerators and all the infrastructure required for iron energy physics research. That includes compute and storage resources. So let's start. Let's go back 11 years, to 2008-2009. That was a time when we were experimenting with visualization and server consolidation. A group inside the IT department started to experiment with visualization, initially with Microsoft Visual Server and later with IPReview. The idea was to offer service resources for the CERN users. The problem was that those tools were in the build to offer such service. So we built a custom web application that interacted with those tools and allowed users to create virtual machines. Only Windows and scientific Linux images were available, but this was huge at that time. It was called CVI, the CERN Virtualization Infrastructure. And it was a very popular service at that time in the organization. By popular, I mean around 100 new virtual machines every month. Remember, this was 2008-2009. At the same time, other group in the IT department was trying to virtualize the batch service. The batch service is the biggest IT service in terms of computing resources. And virtualize it will bring a lot of advantages. The project was called LXCloud and explored two cloud orchestrators at that time. One was an open-source solution, open-able, and the other was a commercial product, ISF from platform computing. The IPrevisor that we used was Zen because KVM wasn't stable enough and not very well supported at that time in scientific Linux 5. And we tested the limits of those products because that was the big requirement for the batch service. We run a scalability test with more than 16,000 virtual machines in both of these projects, which was huge for the time. The early days of OpenStack. In 2010, there was this open-source release of this new project from NASA and Rackspace. You guess it, OpenStack. At that time, if you were working in the cloud computing space, this was big news. There was a lot of press around it. And from the beginning, it got the support from several big companies. And this just created a very vibrant community around it. So of course, we started to look into this new project. And this was very, very exciting times. This is just a funny slide. What you see here is probably the first presentation about OpenStack at CERN. This was in January 2011. So the presentation was based in the first OpenStack release, Austin. You can see that we needed to use NovaManage to create users, no Keystone at all at that time. And to interact with Nova, we needed to use Ilka tools. If you'd like to explore this presentation, you can still find it today in the link below. Just another slide that I found in one of my presentations from 2011. This is a screenshot of an early development version of Horizon in 2011. It's quite different from what it looks today. Let's now talk about the CERN Agile Infrastructure project. This was the CERN project that really brought OpenStack to CERN. In 2011, CERN was still managing the data center using our custom-made tools. The majority of applications were still running on physical nodes, and in two years we knew that we'll get a new data center in Budapest, more than 1,600 kilometers away. All of this to meet the compute and storage requirements for the Large Hadron Collider and for service continuity. Also users were requiring more and more resources. And when using physical nodes, this process can take a long, long time from the request to the provisioning. It was clear that we needed to change the tools, the architecture, and the way we manage our data center. New open source projects were now available, configuration management, monitoring, and resource provisioning, so we shouldn't continue to develop and maintain our own tools. Also a private cloud will offer much better operational and resource efficiency. Jim Bell, the CERN Cloud Infrastructure Manager and also a board member of the OpenStack Foundation presented the vision of the CERN Cloud project in the OpenStack Summit in San Diego in 2012. All of these slides that you are now seeing are from his presentation at that time. To build our private clouds, it was clear that OpenStack was the right project. In these slides, we presented our plan to deploy OpenStack. We needed to understand that at that time we were running Scientific Linux 6 and the RDO packages and the initial puppet modules were just made available. But one of the goals was to manage the new data center in Budapest using OpenStack. This diagram that you see there shows all the project's interaction at that time. We have Horizon, Swift, Glance, Nova, Cinder, Keystone, and of course Quantum there. We said OpenStack was complex in 2011. Imagine today if someone does a diagram like this. To make this possible, our strategy was to build several prototypes to try different architectures and configurations, give them to the users, and iterate fast based on feedback and our discoveries. We had only a few nodes available for these initial tests and it was clear that we needed to support two iProvisers from the beginning. If we would like to migrate into OpenStack, the virtual machines from the Windows virtualization infrastructure that was already running a few thousands of virtual machines, the CVI project that I talked at the beginning. In this timeline, you can see when we deployed these different prototypes and of course notice the cute code names that we gave to them. Functionality was added incrementally into each prototype. Also, we spent a lot of time with users to help them test their applications in infrastructure and understand the requirements. From the beginning that we try to engage with the OpenStack community. From early on, we started participating in the OpenStack meetups around us, London, Zurich, and so on. And actually also organizing the Swiss and run out user group meeting in 2013. Also, since the OpenStack summit in Boston in 2011, that we participated in all of them. This photo is from the first OpenStack summit that I attended in 2012 in San Francisco. It's quite different from what we have today. On July 2013, we moved from prototypes to a production infrastructure. Meaning that we will not destroy the infrastructure anymore, at least voluntarily. And we will support the running workloads for our users. We presented for the first time our production architecture in Hong Kong in 2013. This is one of the pictures from that talk. So our initial deployment looked like. We are running the most recent OpenStack release at the time, Grizzly. We had two cells, one cell in each data center. Cells version one, of course, and we thought that two cells would be enough for the next years. The control plane was fault tolerant and it was running on physical nodes. Understanding resource utilization was important for us, so we deployed Selometer from the beginning. Everything was integrated with the CERN infrastructure, accounting, the network. Glance was backed by AFF. In reality, this is not entirely true because these slides are from November 2013. When we moved to production, the storage was backed by AFS. Only after a few months, we migrated all the images to AFF. Inside the same cell, at that time, we supported KVM and IPReview. In terms of Rabbit, we had three Rabbit MQ clusters, the main control plane and then one per cell. We had no idea how much the infrastructure will change over time. This is a simple overview of architecture and the different components that we are running at that time. Just a fun fact, you can see that we run the first version of StackTac in our infrastructure to consume notifications. Later, the version two required Swift, so we needed to drop these components. This graph shows the fast growth in number of VMs of the infrastructure. In just a few months, from July 2013 to the beginning of October 2013, we are managing more than 2,000 visual machines. Here, you can see the growth of the cloud infrastructure over the years. The total number of VMs created, cumulative in your left and the total number of VMs running in your right. You can see that we reached the VM number 1 million when year and a half after moving into production. It's very difficult to maintain this slide. We can see the projects that we run in each release. In 2013, only Nova, Lance, Horizon, Keystone, and Silometer. And today, the amount of projects that we offer to our users, including Magnum, Barbican, Ironic, and Manila. This is, again, just another funny slide. Some years ago, we tried to draw our architecture in the whiteboard in my office, and this was the result with all the service interactions. You can see components like Silometer at the time, Steel, Nova, the load balancer on top, Magnum, Cinder, and many other projects that we are running at the time. I will go now very fast through some of the milestones of our cloud infrastructure that I find important or very interesting. Nova Cells. We are using Nova Cells from the beginning. Initially, only two Cells, version one, and in 2011, we upgraded to Cells version two. Currently, we run more than 80 Cells in our infrastructure. Silometer, the rise and fall. Running Silometer at scale was a big challenge. It had that complex architecture and some design faults that made it very difficult to deploy, manage, and actually retrieve data. So after three years, we decided to remove this component from our cloud. We started to offer Cinder in 2014, back at BySafe, and a few years later, we also started to offer Manila and an S3 endpoint, both back at BySafe. In 2016, we moved Magnum into production. Since then, our users can create Docker swarm and Kubernetes clusters. It's a very popular service in the organization with more than 500 clusters. Network. In our initial deployment, of course, we are using Nova Network. And we can still see some of that history in our clouds. We still have some old Cells that run Nova Network. However, since a few years ago, all the new Cells run on Neutron. We hope, finally, migrates the old Cells from Nova Network to Neutron soon. Bare Metal. Ironic is in production since 2018. Currently, more than 5,000 physical nodes are managed by Ironic. And the goal is to enroll all the servers available in the data center into Ironic. We try to automate as much as possible. To help on that, we use, from a long time now, Rundeck and Mistral. Through all these years, we've been facing several operations challenges. We try to talk as much as possible about them in our blog post, the tech blog. And also, during all the OpenStack presentations that we have been giving over all these years. So now some of the operation challenges that we faced over all these years. Let's start with upgrades. We upgrade to every release and the cycle is only six months. The Scientific Linux 6 upgrade to CentOS 7. And now, we are again in the process of a major operating system upgrade from CentOS 7 to CentOS 8. We supported for a few years KVM and IPREV in the same infrastructure. Actually, we migrated all the virtual machines from these old infrastructure, CVI, to OpenStack IPREV, and then finally to OpenStack AVM. And of course, the challenge of the latest security updates that we have been facing that require the reboot of almost all the infrastructure. In 2019, we introduced the new regions into the infrastructure. Actually, we split our production infrastructure into two regions. We wrote a blog post on how we describe how we did it. The main reason for this split is nutrient scalability. Today, we have three production regions. And just a few months ago, we introduced preemptive instances into our production infrastructure that allows projects to use the spare capacity. We also wrote about it. You have the link of the blog post if you're interested. Something that we've been doing from the beginning is to work together with the OpenStack community. And I really believe that is part of the success of the CERN Cloud infrastructure. We share our experiences in the OpenStack Summit's presentations, user meetings. We won the first OpenStack User Award in Paris. We have PTAL and core members in different projects like Magnum and Ironic. And we organized several OpenStack events like the OpenStack Day last year. And even during these difficult times, we continue to participate remotely in several events like the OpenDev and the OpenInfra UK. So 2020, what's next? We are looking on leveraging container orchestration to deploy the OpenStack control plane, re-enroll existing physical resources into OpenStack Ironic, exploring the introduction of GPU resources and moving all the remaining resources from Nova Network to Neutron. Also exploring how to provide machine learning platforms and functions as a service to our users. Here you can see a snapshot from one of our monitoring dashboards. You can see off the number of cars available, the instances running, the number of users and projects that we manage in our cloud. A brief summary about the past 10 years. During the last 10 years, resource management and deployment model changed completely, from virtualization and server consolidation to a cloud infrastructure, from bare metal to virtual machines to managed bare metal to containers. We continue to adapt the infrastructure to the new technologies and requirements, control plane managed by Kubernetes, new regions, printable instances. And I would like to finish this presentation without mentioning all the people, staff members, fellows, project associates, technical students, summer students, members from other labs that have contributed to the deployment of the certain cloud infrastructure. Thank you to all of them and this list is not in any particular order. I'm happy now to answer all your questions using the chat platform. Also, you can connect with me using Twitter. Thank you for watching this presentation. I hope you like it.