 Yeah. Hello, everybody. And the title of the presentation is about the numerical weather prediction and WQP, as we say it in meteorology, and death observation, data processing, and work cloud computation, and work clouds in the cloud. So probably you have, for those that have seen the morning keynote, probably we're going to talk again for Europe and for the cloud, for ASMRBF, where we do what is our mission, our team's first lecture. And then we're going to talk about the European weather cloud and how we build it, the objectives, and the technical parts that we have. I didn't have the opportunity this morning to discuss. And then we're going to finish again with a conclusion on the next steps. So first of all, as the ASMRBF stands for European Center for Medium-Range Weather forecast, we are in the government organization, established in 1975. Initially, the headquarters were in the UK. And now we are spreading all over Europe with our data center being moved to Bologna, Italy. And our office is also, we have offices in Bonn. We are a numerical operation on the numerical weather prediction center and the research institute that we work at 24-7. So every day, we run our model, IFS, integrated forecast system. And we designate that globally to many organizations on all our member states. Probably you have seen our products and you have used our products in your daily life for when you're checking the weather. Probably the apps, they're using our model or they have a local model based on our model. But essentially, what we do is not only for my knowledge, but it also touches all the citizens globally. So we assimilate about 80 million observations per day. And then we feed that to our model and we run the forecast for the next days. We have a bigger archive of data and observation. Everything that we gather, we store them and we keep them for years. So that give us about 250 or actually 300 petabytes with a daily increase about 250 terabytes, which is not all the production or daily production, but is part of it. So we have an HPC facility, which is one of the biggest in the world. Every five years, we have a new HPC. Now we're in the process of leasing a new HPC based on the autosome. This is the next slide. And we also have a cloud infrastructure that we contribute for the Copernicus Climate Change Service, C3S, Copernicus Automotive Monitoring Service. And Wicchio, which is one of the cloud platforms that give access to Copernicus services and Copernicus data coming directly from the satellites. And as I said before, we have a big database of daily observation and output of our model. So this is more or less our HPC. This is the current state on the left-hand side. And then what is being developed, deployed in Bologna by the end of this year, it's going to be a person. It's going to be five, almost four times faster than the existing one. And probably we're going to buy a BIA again in the top 10 HPC centers in the world. So again, so we have HPC, our users. We have a user base of about thousands of users in the member states, research institutes from our member states. But we want to make it easier for them to access our data and process their data closer to their physical location. That's why we came up with the idea of a cloud, European with a cloud, together with our human side, another European organization that exploits the data coming from meteorological satellites. And this is the vision of this cloud, of this community cloud. So we started in 2019 with the human side. We built the cloud. We have also to have all the governance of this. The technical policies, how the users are going to access the cloud, because we are not public cloud service providers. We're a community cloud. And we do our best to provide the best service to our member states. So all this happened in 2019. We are expecting to the operational phase to be by the end of this year, with new infrastructure being developed in Bologna, in our new and big data center. But in order to arrive to this condition, to this state, we started and we built ourselves, we used OpenStack and CIF for storage and cloud. And we built it together and we have also a platform that we serve to our member states. But some use cases that are very interesting and that we're gonna talk later. So this is more or less the production line of our, in general, of a CMWF with data acquisition, with pre-processing. And then we run the model. There are some steps that are missing here. But in general, the block diagram, it depicts how the production workflows works at the CMWF. So at the end, one of the products ready, these are archived and all disseminated to the end users, either through internet or through a list lines. So we have also a meteorological data network, ARM DCN, and the same data can be also accessed from the Europe or the cloud with direct connection to our archived system and to HPC. And the users can either use this data, post-processing this data in the Europe or in the cloud and then provide a tiny version of the, or the visualization of this data to, through the front end in public cloud service providers. So this is the outline in general. So we have the SIF cluster that is mainly from all the basic components. We have a lot of balance so that they have direct access to S3 buckets. And the same thing, and this can be accessed with this URL, storage.cmwf, Europe or the cloud. And then we have also the same connection of the left-hand side is the human chart, which is actually where provides more or less the same capabilities, the same infrastructure for accessing data coming from the satellites. So ECMWF provides NWP data and human satellite data. So our users can have virtual resources on both clouds and they can do whatever they like with regarding processing, either federating or moving data from one pre-processing them on the one side of the cloud and processing them in the other side of the cloud. All the blue shaded part of the right-hand side are the internal system with the dissemination system with our HB system with fields database. RFS is the integrated forecast system and all these things can be accessed from the cloud. So, beside that, we try to expand to our members' data so if members' data would like to fade out with this cloud, Europe or the cloud, we have the setup can be expanded to them without a lot of effort. But if someone logs in through one of the cloud or sets out that resize either on a human satellite or ECMWF, they can create virtual resources in any side of the available infrastructure. In that case, we have one, we have a use case with Metnoy that we integrated the cloud and people from human satellite either from ECMWF can create visual resources there. But this needs to be further exploited and we are anticipating the future might expand it further. So, now. All right, thank you. Hello, everyone, my name is Harry. I'm an OpenStack engineer at ECMWF. So, in the next couple of minutes, I will try to very briefly explain the OpenStack journey at ECMWF. So, as my colleague already mentioned, one of the big issues is size. If you have an OpenStack VM, in meteorology, a terabyte of data is very small. So, if you have to do that once or twice or three times a day and download data, that's a big problem if you're far away. So, as my colleague said, the big issue was bringing the computation close to the data in order to benefit from the speed of the local network. So, our journey started in 2019. The first deployment was done on OpenStack Rocky. We are triple guys, but there are other installers. We just chose to use this tool. So, the CIF cluster will be covered in more detail by my colleague later on, but it is done externally to the OpenStack deployment and it is managed with mass and puppet for its lifecycle. And for the original version, it looked like this. We just have deployed a standard physical undercloud and with that undercloud, we just deployed a Rocky version of OpenStack and that version was enough for in the beginning. But as the project moved on, we had to expand and add other systems and other more usability and so on. So, after a couple of months, the problem of updates came up and we were also in a very difficult situation because it was also the movement from center seven to center eight and to Python three and so on. So, the following diagram explains our journey. We decided to skip a few of the upgrades and we deployed a second OpenStack over cloud based on Usuri. So, that means there is a difference between those versions and you can also see that we had, now this time we had virtualized underclouds for mostly for our snapshots. But then, we created the second infrastructure and we were able to migrate from the ROK infrastructure the users to the Usuri infrastructure almost seamlessly because those two infrastructures are the same self-backend. So, it was reasonably simple to just ask the users to have a couple of minutes of downtime and we'll make it work for you. And this journey has continued. This is, I think, 2020 to 2021 and we maintained those infrastructures for a while but then slowly, slowly phasing the ROK infrastructure away and moving everybody to Usuri. And this is, I think, the infrastructure as it looks today. As you can see, we have only the Usuri infrastructure, the CMWF. We have a number of virtual machines that support that infrastructure in regards to monitoring, in regards to testing, in regards to licensing for the GPU software and so on. And there is also the federation layer is visible. We use another system for the federation as my colleague already mentioned with Met Norway, for example, of our humans app. Here on the slide, you can see the size of our current infrastructure but it is expected to grow. But overall, we're very satisfied with the performance of the system. For Cef, Vasily? Yeah, yeah, Cef Cluster was, we started the first installment in 2019. Since then, we used the latest version of Cef at that time and then we upgraded to the latest version of Nautilus. Because moving talk tap was, and later we have to containerize the thing and we have to change also the operating system. We haven't done that. We wanted to do it, but for virus, the reason we haven't done it. So it's very simple configuration. So we have 23 Dell systems. We have two, all the operating systems with a rate one, all the other systems were either HD or SSD with 1.8 by the storage and we followed the two bonded in the phases of 20 to 25, a gig of a gigabit per second in the connection. So we have probably the numbers that you see there with the monitors, 23 monitors and the radios gateways. That was an experiment. We wanted to see if we can have lots of monitors without affecting the performance. So far, so good. Three years later, we haven't had any serious issue and we have a workload that they are very demanding. We have also some, they have a lot of file, especially the meteorological databases that are keeping a lot of data and fields, which is actually very good. And we keep it as it is. So we kept it in Nautilus, but when we move to the new operating model in Bologna, we're gonna have the latest and greatest version of Cef. So, and we haven't had any big issue with Cef and that's it. If, honestly, something that I wanted to mention because we mentioned about mass, if I had done it again, probably I would have done it with Ironic. Instead of mass, which is, for me, it was more or less straightforward. Thank you. And now, back to the GPU issues, a lot. Many, it has become obvious to you, CMWF, that a lot of workloads are transitioning to ML and GPUs have really become a bit of an issue. And so we have a number of GPUs now available in front of our infrastructure. We have opted for the VGPU setup instead of the PCI pass through. In that case, we believe that better utilization of resources is achieved. We have three different profiles for the users to use and we will continue to expand and support. We work with NVIDIA VGPUs for that. And there are also a number of other auxiliary systems supporting the infrastructure. The first one is the disk image builder component. We provide somewhat custom images of all the mainline operating systems. Usually they have a little bit of extra binaries that are specific to a CMWF. And this is done only in order to support a more easy transition for the user. They just have some of the binaries already in the VM when that VM is spawned, no matter where it's spawned, if it's in CMWF, or humans are nowhere, or other federating partners. So right now, we support those operating systems. And there are also other monitoring systems, for example. We have Prometheus exporters which show a lot of information. We update data both from the Linux kernel, from OpenStack, and we also have some usage data that we monitor. We support 35 member states, and its member state has different requirements. So we have to do our best to support them in a perhaps not different way. And of course, its member state somebody might have 100 VMs, somebody might have 10 VMs. So we monitor that as well. And we also monitor the infrastructure in other systems. There are operators because we integrate with the HPC and the data handling system. So we have to also export that information to those users. So this is an example of the Grafana dashboard that we see every day. There are more than that, but as I said, we get a lot of information from all our systems, from Ceph and so on. It's just something that is very important for us because we have member states. Our budget comes from our member states. So when we distribute the resources to member states, they are based on their contribution to our budget. So the accounting is important. So this is, we monitor whatever they do on the cloud, the consumption of the resources, and because we have to report on them every year and say, this is what you consumed. And this is what we do with HPC and what we're going to be doing also for the cloud. So user space, 35 member states. We have actually a cross-disciplinary reference and with machine learning and artificial intelligence. Machine learning is also a lot of footprint nowadays in meteorology. We are trying to, people, some colleagues, not personal to me, but some colleagues, they are trying to take some parts of the IFS and run it on machine learning, using machine learning so that they improve the time of the model to complete and because some processes are based, cannot be perfectly mapped with the physical process, cannot be really mapped with the algorithm. So machine learning helps a lot. And there are some other domains that we have actually reached out. We have also turned in cases. We are participating as a ECMWF to many projects relating to cloud computing and the edge computing as well because we are interested about data coming from IoT devices. And about research, what we do at European with the cloud is we're providing the resources. Not only the images, but we're trying to inject software that can be used from our member states, like we have open source tools developed by ECMWF, which are available in these images. And the other things that are very close to their data, so with an API code, they can have access to petabytes of data. Not in real time, let's be realistic, but they can get it, let's say, if someone asked for the temperature of two meters above the ground for the last decade from a specific place, they're gonna get it relatively very fast from our database because we have to go through all the data sets. This is more or less what we provide and the interesting part is the use cases that we have and they are from training, from data, from machine learning, or personal system support, and also data processing. So training nowadays at ECMWF, during COVID, it was excellent because we had to find a training platform so we did it on European with the cloud. The funny thing is that also the European with the cloud, the holy first actually we did it through COVID, most of it, and the upgrades were also remote by saying, okay, we have this system installed there and then we're gonna do our staff remotely. We have also DWD from the Germanman service that they have a very interesting lab for the ICON model, which is a local model for Germany and other places. Metro France with a cloud cover, which runs both of the ECMWF at the U-Metal. We have also some other applications from the North, South, East, Mediterranean, where for hazards that the predictive, if there's any flood or something else, and many other things from DWD. DWD and Metro France are very supporting for the European with the cloud and soon we're gonna have also some observation relay running on the European with the cloud from WMO, World Meteorological Organization. And also we run some of weather code at where developers test their code with our data in our infrastructure. This is more or less, there are many other interesting applications, but there are some highlights, but the important is that it's not the size of the cloud, it's not this, but the importance for the member states and the application that they developed and make available to either to other parts of the organization or to the end user or to European citizens. So there are cases that we run like the hazards. So if there's any flood coming up, they have to mobilize the civil agencies in the areas. So this is what we do and by doing that, they're saving lives. This is really important. I just wanted to underline this as well. We're monitoring floods, we're monitoring storms, we're monitoring particles around Europe. So it can be really, really important for somebody that he has this data when he needs to. Yeah, so sometimes the value, you realize the value when you see where we're contributing with all this infrastructure and to the end users, like planes that don't fly, if they don't have the model, if they don't have the forecast and see if they don't say something like that. So this is the vision of the European with the cloud. What we, the ecological community is very mature. So we are, actually we are all of us. We know each other, member states, know ECMWF and the vice versa. And that's why we try to make this community cloud even further and with adding more members. So and the idea is that we have the infrastructure as a service and we are moving towards a platform as a service and software as a service so that we add more and making more service available to the end user. So all the self-picking things and then creating them with building blocks that can create their own application or running their own models. And the other things that we try to contribute also to other initiatives. One is the Nation Earth. It's one of the flagships of the European Commission for the coming 10 years where ECMWF, UMETAT and the European Space Agency, we work together that we create the digital twins of the Earth for extreme weather forecasting but also for the climate change. So initially we're going to deploy by the end of the 2020 mid, sorry, 2024, mid-2024. We're going to deploy the two initial digital trains that are going to provide all the information needed for extreme weather forecasting, for floods, for any other things that need action from the stakeholders. And the European Open Science Cloud, we also try to contribute and GAX and data spaces and by also contributing to AU-funded projects. We have a number of AU-funded projects that we are working on cloud computing and edge computing because we are very interested to inject IoT measurements in our model. So coming from IoT devices directly to our model. That's it, more or less our presentation. So if you want to ask something and if we can answer the question, we're happy to do it. Thank you. We have a question, maybe? Yeah, yeah, yeah, it's a question. Thanks a lot for the talk, it was very interesting. I have two questions actually. The first one is about the amount of data that you store. You said like it's 250 terabytes a day that's roughly 100 petabytes a year. How much of this data do you store permanently? Or you basically use this in order to feed your models and make some predictions and then you delete the data? So how much do you store permanently? The 250 terabytes are permanently. Whatever we produce and whatever we get, we always, after filtering again, we store it in our database. We store it in the archive, by the way. In the archive, yes. I say the database, but I mean the archive, yeah. Okay, so you store roughly 100 petabytes a year, is that it? Yeah. Okay, interesting. It's one of the biggest ones in Europe and I think worldwide and yeah. Well, this is the jewel of our organization. Right, this is because it's very close to what we store at CERN for the experiments as well as roughly the same ballpark as I was asking. The second question I had is about the migration that you mentioned from Rocky to Yusuri. That was a little bit fast for me, like how do we actually manage because you said like a couple of minutes down time for the users and you moved from Rocky to Yusuri. Okay. Yes, so there was a little bit of voodooing behind the scenes, I will be very honest. So there are two types of users. Either they can migrate on their own and I don't have to do anything. I just tell them, delete your VM and put it over here. So that's the easy part. Test users. Yes, that's a good user. The worst user is the one that has, let's say, one VM and that VM has two volumes. Simple example, let's say 40 gigabytes per volume. Rough numbers, it doesn't really matter, but he wants that VM running or he wants a very minimal downtime. So the two infrastructures, they are both Chef, the backend is Chef in all of them and Chef is external, right? So Nova sees that backend, they both see the same backend pools, okay? So what we did is shut down the VM, create a clone in Chef and then that new clone with a new UID and then I would instruct Cinder to see this new block device. And now the clone now suddenly has appeared in the new infrastructure and with the volume already present, I can, okay, the IP is not a big problem. So that I don't think can happen for hundreds and thousands of VMs probably, but for the number of VMs that were really critical at that time, because as I said, there are the good users, so many of them I could. We could slowly, slowly make the migration to the new system. How many users or VMs are we talking about in your case? Yeah, I can say, but. Yeah, there were not that many to be honest, so. And Rocky and Usui, actually what Harry said is they are pointing to the same Chef pool, so that was easy to migrate. We did it on purpose because we knew that we have to migrate a few VMs and not asking users to recreate the VMs on the new cluster. But they were not one of them, they're like, I don't know, maybe a hundred, maybe, for the critical ones and the non-critical ones are simpler. Cool, thank you. Just because you mentioned CERN, you are from CERN, you know that part of our database, the data handling system came from a donation from CERN. Probably, yeah. We took some robotic systems from CERN and we moved it to Bologna. Okay, thanks. Thank you. We have another question, maybe. Yes, hello, thank you for your talk. I saw that in your Chef cluster, you are using hard drives and also SSD. Yes. Are you using cache tiering or separate pools for different workloads? Actually, everything was a test, so we haven't done a lot of configures regarding SSDs. If this is what you are asking, but it was a plain money like configures, I would say. So all the different disks are together in the same pool? Sorry, yeah. Sorry, all the different disks are together in the same pool, both SSD and hard drives? When you create a pool, actually, you can say that you want to be either SSD or SSD, so there are some separations, yeah, that was done. And we have configures across to allocate differently the pools. Okay, thank you. I see no more questions. And that's it. Perfect, one time, 30 minutes. Thank you all for your time. Thank you very much.