 Good afternoon everybody, and welcome to our presentation, Monitoring as a Service in the HPC Cloud. I'm Darrell Weaver, Head of Cloud Engineering at Vernglobal, and my colleague is Stig Telfa from, sorry, CTO at Stack HPC. So just a quick introduction to Vernglobal and our product HPC Direct. So Vernglobal is an Iceland data center, and because we're located in Iceland, it's powered by renewable energy, geothermal energy and hydroelectric power. So we also, because of the temperature of the air, have filtered air cooling rather than powered air conditioning. And so the carbon footprint of the data center is almost zero. Now we have a product called HPC Direct, and HPC Direct is aimed at HPC workloads. HPC workloads include a number of different sectors, including scientific computing, weather and climate analysis, machine learning, bioinformatics, et cetera. Now we have a term that we use to describe an HPC cluster's characteristics called true HPC. True HPC has a long list of characteristics, but primarily it's on bare metal with dedicated hardware, low latency connections between servers, which usually means being on the same switch. Job schedulers are usually included in order to make maximum use of the hardware. And we have high performance storage included, and even licensing tools to manage certain licensed software as well. All of this is included as part of our product, HPC Direct. And as part of that, we've recently included monitoring as a service as well. So we have our HPC Direct cloud is based on bare metal cloud service, which is obviously provided by OpenStack Ironic. We use Stack HPC's tool, KOB, to deploy OpenStack and manage OpenStack over its life cycle. And with that leverage is color ansible to do the actual deployment of the containers and services. We found in our experience that gives us smooth upgrade paths between different versions. And we're currently in production on the Queen's release of OpenStack. Now on top of that, we also build a customer portal, which we don't expect most HPC customers to learn OpenStack to manage their workloads. So our customer portal is focused on ease of use and what we call blueprints. We created a blueprint, which is a one-click deployment of a complete cluster configuration, which details the number of nodes, the different types of nodes, so login nodes and compute nodes, for example, the flavors of those nodes. And we also include on top of that the actual HPC workloads we're going to deploy onto those nodes by using ansible roles. Now, recently we've added tenant workload monitoring and that of course uses Manasca. So I hand over to Stig to talk about Manasca. Thanks, Darryl. This is Manasca. The diagram is pretty daunting. But actually, if you look into the details, what it is is a series of best of breed services glued together with some smart logic, which adds value in order to provide higher level capabilities like multi-tenancy for users and the ability to push in arbitrary metrics from user applications, manage that with logs, transform logs into metrics and a whole load of other compelling features, which you won't see anywhere else in the monitoring space. So the way that that fits into the deployments on HPC Direct is that when the portal instructs to launch a new Blueprints, OpenStack is instructed to create the infrastructure. And as Darryl says, once the infrastructure is created, Ansible Playbooks will then pave the infrastructure to form the platform, which is customized and tailored to the client's requirements. That could include things like workload managers, high performance parallel file systems, infiniband configuration, but also it embeds within the deployed instances the ability to automatically collect, monitoring, logging, telemetry information, post that off to the user's project within Manasca and then save it there for free for the user's service and provide it back to the user as a service in terms of dashboards, in terms of logging and so on and so forth. So instead of having your users to create their own logging and monitoring solutions, having to design those and then pay for the operation of those things, using Manasca and HPC Direct enables HPC Direct to provide these things as a value added service to the clients and actually to maintain a very strong configuration for gathering high performance computing, performance telemetry, and providing it back to the users for free. Looking further ahead, I mean, this is pretty good, but we don't need to stop there because we have this longer view of the idea that we can create a holistic performance analysis solution. And the idea here is that we can gather performance data and telemetry information from all levels of the infrastructure, from the network switches at the physical layer, the storage appliances, all the way through the operating systems of the server, the workload environment and the applications themselves can generate their own custom telemetry in the context of whatever the application is doing. So we can actually provide these very rich sources of information and then view the telemetry from one domain in the context of another. And this is how we can provide telemetry which is really insightful and really rich and really enables the users of HPC Direct to understand why their application is performing the way it is and how they could make it faster. Thank you. So let's actually have a look at some use cases. Oh, one way. So one of our customers who is on our platform is called Satavia. And Satavia provide expert analysis for the aviation industry. Intelligent Fleet and Weather Analysis Service is how they describe it. They basically advise on flight operations and maintenance for aircraft fleets. An example of that is that they determine if aircraft need to burn extra fuel to remove ice crystals from the engines before they land. And sometimes they do, sometimes they don't depending on the weather. Satavia provide that analysis in a timely manner so that aircraft operators can make that decision based on the latest data. So here we have an example of monitoring of the workload for the Satavia computer computation runs. And there we can see on the top row we've got the CPU usage. So you can see user system and the CPU frequency. On the middle row we've got available memory. Then we've got the local storage used there. And on the bottom we've got the network traffic outputs. Now obviously you can drill down into this. This is a Grafana front end so you can isolate it per host. You can click on data points and you can get some pretty detailed analysis through that. Now we also have BGFS on their cluster as well. And that runs over an infinite band interface for very low latency and fast storage. And here we can see just an example of the network traffic that is generated when they are storing large volumes of data onto that BGFS shared file system. Now one of the things that we did find was extremely useful by providing the tenant monitoring is that we had a situation where Satavia were running a test of about 10 nodes and we were looking at the CPU usage. Or more importantly the customer was looking at the CPU usage for the performance of their jobs. And then what we actually found was that they were seeing a drop in performance of their CPUs on just a couple of the nodes for no apparent reason. That prompted us to delve into it in more detail. And then we found that actually there was a bug in the hardware at the IPMI interface layer. We weren't able to see basically there were some fans that weren't working properly. And what happened was that the CPUs were overheating and then it was dropping the frequency of the CPU to protect the CPU and therefore the jobs were running slower. So this is a really good sort of troubleshooting tool and optimization tool for workloads on our HPC cloud. So thank you very much for listening. As I said, I'm Dara Weaver. This is Dick Telfa. And thank you for listening to our presentation.