 Hello, thank you all for joining me today. My name is Vasilius Bauses from ECMWF and my presentation is about how ECMWF built the European Weather Cloud Infrastructure with open source software journey from start until today and its future. So the agenda of this presentation just follows. First, I will talk about ECMWF which stands for European Center for Medium Range Weather Forkast. Its mission is to get the infrastructure, HPC and cloud and about the new data center which is being built in Bologna, Italy. Then I will present the European Weather Cloud Infrastructure, its objectives, architecture use cases and then discuss about the safe and open stock clusters, our experience, decision made and problems encountered. Finally, I will present our future next steps. So ECMWF is an intergovernmental organization established in 1975, has 34 states as members and cooperating states, 22 members and 12 cooperating states. The headquarters is in Reading, UK and soon our data center will be moved to Bologna, Italy. ECMWF's core mission is to produce a numerical weather forecast and monitor the earth system and carry out also scientific and technical research to improve forecast skills and to maintain an archive of meteorological data. Also ECMWF operates two services from the European Union's Copernicus Earth Observation Program, namely COUNTS, which stands for Copernicus Atmospheric Mound Service and Climate Change Service, C3S, and also contributes to emergency management service also from Copernicus. Similarly, we have an operational and numerical weather prediction center and the Research Institute and the 24-7 operational center. Twice per day operational weather forecasts are generated and disseminated globally. ECMWF also assimilates 60 to 80 million observations per day and maintains an archive of petabytes containing both observations and the forecast data. Our HBC facility is one of the largest weather sites. Also our cloud-diversite recludes besides the C3S and COMS, as I said before. Wikioke, which is one of the European Commission's DEAS, which stands for Data and Information Access Service and the European Weather Cloud Infrastructure, which is the topic of this presentation. Finally, we maintain an archive of climatological data with a size of 25 petabytes with a daily growth of 250 terabytes. So this is the graphical representation of our production workflow from data acquisition, as we see here, to pre-processing, where observations are gathered from various sources, ground stations, satellites, aircraft, ships, etc., and the filter normalized and then provided some input to our forecast model. After the model run, the product generation creates the tailored model output for our end users and finally the data archived and disseminated through the internet list lines and the regional meteorological data communication network, ARM DCN, and to the end users. So this diagram shows more or less the two identical array HC40 clusters on top of the slide and how these are connected to our data handling system, internal also systems, and how these are connected to the internet and ARM DCN. Soon our data center will be moved to Bologna and which is currently built in Bologna, Italy, and our new HP system will take over an existing ICMWS mission critical effort services and the production service, the former dollars gallery service activities, roughly in 2021. In a resilient overall system configuration, the new HPC will provide several thousands of compute and the post processing dual circuit nodes, high bandwidth access from all computer resources via a native Luster parallel file system mounts. Compared to the current HPC facility, the new HPC system will enable a significant resolution increases in the operational data assimilation and forecasting system and as well a substantially increased capacity for research activities whenever resources are not being used towards operational forecasting. The new system will deliver an increase in sustained performance by a factor of about five compared to current ICMWS high performance computing facility. This will enable a range of advances and contribute to forecast improvements. So the European World Cloud started three years ago with a pilot project in January 2019 in collaboration between, which is a collaboration between ICMWS and UMATSAT with a basic goal to bring the computational resources of the cloud closer to the big data that we have which comprises motorological archive, satellite data so that the users from our member states to rather a data processing close to their data source, storage and HPC, which is complementary to their current access to our HPC computing facilities. Also, ICMWS or European World Cloud is aiming at building a cloud federation which member states. So the project includes building the necessary infrastructure, organizing, implementing users' cases and addressing technical challenges, policy challenges, also governance challenges. ICMWS pilot infrastructure was built with open source software, SEPH and OpenStack using Tplio. So in this current state of the infrastructure comprising two OpenStack cluster, one built with the OpenStack Rocky version and another one with Usuri. So both cluster, their other cloud systems are virtual machine hosted in the same physical host, use the same physical and logical networking, same provisioning network, internal external network reliance, etc. and are accessed through the same cloud federation orchestration platform. User access is totally transparent so users cannot identify if they are running on one of these two clusters and both cluster connect to SEPH cluster through SEPH public network interfaces which is the same subject as the two public network of OpenStack cluster. The SEPH cluster is accessed from anywhere from our land and through our external load balancer from the internet using also accesses only to Windows Gateway. SEPH provides block and object storage to both OpenStack cluster. The total hardware of the current configuration comprises approximately about 3,000 VCPUs. The realm of both clusters is about 21 terabytes, so it's about one petabyte with HDD and SSD. And we have also in the new cluster built on Usuri, you have 10 NVIDIA Tesla V100. So in this slide, actually if you remember from the previous one, it depicts where the European with the cloud if you actually sit within and communicate with our internal system. OpenStack has two provider networks, one towards the internet and another one connecting OpenStack to our data handling system, HBC and SMWF internal systems. These two networks are totally separated and the VMs combine to these two, either with a direct attached NICS and networking interfaces or using a floating IP addresses. These two provider networks give a lot of flexibility or close running on the VMs. So in the current, in this slide, actually more than this block diagram depicts the physical and logical connectivity of OpenStack to SEF, internal systems like data handling system and HBC. Access directly to the internet is provided through our load balancer here to the Redos gateway. At the bottom of here, it's the SEF cluster, which is available in the phases are accessed from OpenStack clusters and as well from the various internal systems. Redos gateway is also accessed from the internet through our load balancer. The FQDEN for this is storage.smwf.eu with a cloud, which is the domain name of a European with a cloud, naturally. And then at the top right are the two OpenStack clusters with a public network towards the internet. Below is the data network which connects OpenStack to HBC, to our internal system, like data handling system and meteorological archive systems and our dissemination system. Finally, our users can have access to the equivalent infrastructure at the U-match out through the internet and to our SEF and OpenStack and the internal systems. So the main use cases, actually the users are, it's the European with a cloud supported by it's a member of 34 countries. Also, the European with a cloud has a potential for cursor discipline research and machine learning and artificial intelligence using climate and weather data to other research domains like finance, GIS and other domains. Also enables a new research and development of new service by providing computational resources to ECMWF's member states to run their data processing close to the data sources. Stories are archive and HBC which is complementary to our current access to HBC computing facilities. Also, we provide to reconfigure the virtual resources for performing big data analytics and machine learning. So a user will be able to instantiate and reconfigure Apache Spark cluster or test of flow cluster and perform his computation. Also, a user friendly environment with a pre-installed pre-configured software that for development data access tools and standardized data access API that would spawn across many scientific discipline to enable users to access and provide the data sets from different sources and fields. Also readily available cloud and HBC resource would allow ECMWF member states and the research community to access and process a forecast products in a timely manner and therefore weather forecast and products will reach faster than users like European citizens and not only. Also third party users could use the standardized data access API to prototype and deploy a personal software with value added products services to industry research and society. So some of the use cases now we currently run in our first sector is from DWD, the German Metallical Service use case on offering notebooks to train and develop the icon model of the European which runs on our infrastructure. The KNMI, the regular Netherlands Metallurgical Institute is there with Climb and Explorer which also runs in our infrastructure Oxford University offering Jupyter notebook environments for machine learning on weather and climate data sets and also because there was an earthquake back in March 2020 we supported the Croatian Metallurgical Service to host at the European weather cloud those components of their IT infrastructure that could not run in their local infrastructure. So the first let's start with how is the different components of the European weather cloud infrastructure were built. Cef is built and maintained separately to open stack as I mentioned before that is that gives us a lot of flexibility in building different classes on the same Cef storage. The operating system of this is a center seven the latest and the current version is the latest version of Nautilus. We have the hardware some Dell system and the storage has 26 disk and the total is giving us about one petabyte of storage. Networking has two 25 nicks for its network for the cluster and for another one pair for the public nicks and they have 192 gigabytes of RAM. Totally we deployed 23 monitors five managers three gateways which are a lot balanced and we have a almost we have a 50 552 OSDs. The first build was about about a one and a half years ago and expanded to its kind capacity about six months ago. The we had actually because we wanted to make it and to make sure that we provided best service with a cluster with this cluster we had the third party validation with some and the suggestion were some minor improvements and we have performed in our cluster. Both open stack cluster use the same software infrastructure and actually the same RBD pools. So besides the usual ATD failure SEF performs well. We plan to gradually move from to send us a operating system for the hosting systems and then upgrade to the latest SEF octopus probably by the end of this year and we plan to do it that in a live cluster. Actually we're going to first we're going to do a lot of testing in our development environment and then migrate our cluster. So about open stack now the first cluster was built in September 2019 based on Rocky with a triple installer and at the same time we developed we created another development environment with open stack and SEF clusters actually which identical to the main cluster. Our configuration experience so deployment about 1,600 vCPUs with a 21 terabyte of RAM was straightforward. External SEF cluster worked almost with a minimum effort by simply configuring the the SEF configure YAML with no a lot of modification just the IP addresses of the monitors and some and the keys and that work out of the box. The two external networks one public facing another for fast access to our 250 petabyte archive was straightforward but most of our VMs VMs are attached to both the external networks so we don't use most of the times we don't use floating IP address and for the for the external networks so that was a challenging problem for not challenging issue for the VMs because on the switches we don't use dynamic routing and the workaround was to use DHCP hooks and by configuring our VM routing before we make these available the images to the users. Our amounts can work without any problem with any open stack and fast access so there are minor modifications and there's no thing that prevents these images to be used elsewhere. There are some problems also that we encountered with Nick building the phases in the beginning and in contrast with the configuration for our switches and then we decided not to use LACP configuration and we had already a single Nick deployment for open stack and however also they provided a network where it was separate. We encountered some problems known problems with the load balancer so one of these the one that I present here from Landspot so the problem there it was that Octavia certificates were overriding each time that you redeploy your cluster so load balancer was not properly working there and then as soon as we found some workarounds we updated our system live system and we moved the whole cluster from a single Nick deployment to multiple Nick deployment on a live system totally transformed to user without with a zero downtime. The whole first cluster again was redeployed the network was reconfigured with a distributed virtual rooting for better network performance. In general the performance good and we're happy about that and we haven't had any problem since September. So in March we added some hardware to our postdoc and safe cluster and we decided to investigate upgrading the newest version of postdoc. First we converted our rookie other cloud to other cloud to a VM for better management and also the showy as well and that's and also as a safe net for backups and recovery for something when the sideways. From March to May actually we investigated and tested upgrading to stain first under cloud and then over cloud upgraded to a test environment migration from docker actually we admire the installer while we are moving from docker to port month and it was very interesting and safe forward but updating was possible from to move from rookie to stain and then to train and finally to showy but due to the upgrade that the latest version of open stack were based on the centers we decided to jump some upgrade updates and that's why we decided to use a showy which was released in May and actually it was also based on the center state so that was we made a jump from a rookie to showy so it was three versions jump and we and we deploy the new version using showy so so the second cluster first build was 30th of May 2020 so if this is a screen so that we took when we deployed the new cluster that was 17 days ago it was officially released of showy this cluster was playing vanilla configuration which means that even though that the network was properly configured with OVN with provider networks and everything properly and 25 nodes but we haven't had any integration with storage with SF as you can see and analyze below so at that time we tried to evaluate the cluster the new cluster with 25 nodes so we run some tests and we found some problems and actually when we try to add some more features we had some other problems so the new build some of the problems changes or challenges were that the new building methods on with using assembly was rather than a mean style had some hiccups like the user to used to deploy the stack so to instead of stack it was a hit admin so there are some implications on that sent to say the base operating system for both the whole system and all the container it was something that quite to understand and master very fast also we configured with OBS and not OVN finally because we found that there are some implication regarding assigning floating IP address so we found some problems and we reported and we found a lot of and we get we got a lot of help from the community and one of those it was that Octavia and Ceph they had the similar problems and that was because the the permission issues of the assembly user which prevent the read write access to the configuration of the folder of the over cloud which usually located at config download over cloud Octavia or Ceph as a blue you can you cannot write on this folder so we found a workaround and we managed to deploy both Octavia and Ceph and the other one it was with OVN it was the assignment of if you had the DVR and you wanted to also assign the floating IP address the problem is that when using OVN floating IP address do not work if trying to connect from system in the same public subnet it did work which was a bit confusing because it was on the same VLAN probably this worked due to the fact that the problem is that external maca address should not be set in the not table of the floating IP address which belongs to the logical port of the VLAN of the tenant logical switch so initially this is always fixed in a previous version of OpenStack OVN network in 402 and then to 510 and then again to 601 but again it was not something that at that time could result so we deploy everything with OVN so currently regarding GPUs the configuration of the GPUs was all straightforward we had some problem for IPv6 we haven't implemented IPv6 to our story cluster so the problem is that due to the booting time we had the problem with OVN trying to bind IPv6 addresses and that was resulting in increased booting time a workaround was to explicitly remove IPv6 configures to all our GPU nodes all nodes with GPU also are as normal compute nodes and we configured Nova config with our Ansible Playbooks it was much easier to configure in this way the GPU profiles offered without changing the compute nodes or with their deployment and things like that so as you can see on this table we installed a GPUs were installed on the compute nodes and we would offer five different profiles ranging from faster configuration so the entire GPU to be assigned to a VM and to partition the VM to for GP VGPUs and assign them to four different VMs running on the same node all the GPU profiles were assigned to VMs using specific VM flavors for each profile so final our next step is to regarding infrastructure integration with our other internal system for better monitoring logging facing out a gradually Iraqi cluster and move all nodes to Siri operating maintaining upgrading the new versions always we try to follow the latest version both in Non-Persac and Cef Federation we plan to federate our cloud infrastructure with our member states and we have a potentially good use case to federate but again it's still working progress also integrate with and interface with other project so European with the cloud would be interfacing with the digital to in Earth part of the destination Earth program of EU and also there are some other European projects that can be considered for integration like they open the European open science cloud etc so also we plan to contribute more for a code and to the OpenStack community and also help other users that facing the same problems that we faced the way while we were deploying our clusters and Cef so this slide concludes my presentation about European with the cloud infrastructure built with open source and I think that I cover the the whole journey from from the beginning to the current day and what we plan to do in the future thank you very much for attending and I will I'll be happy to take your questions thank you