 And I have Matt Pryor here. We are from a company called StackHPC. And today, we're going to talk about what we do with compute platforms, platforms as a service, as opposed to infrastructure, particularly for the scientific computing use cases. I don't think anything we're going to cover is particularly specific to that domain, it's just the domain in which we work in. So we have been working together in our company on an open source project called Asimuth, which has been developed by a lab in the UK. And we have been working to further develop that system and make it into a very nice sort of self-service platform for various types of compute platform systems. We're going to talk about some of the particular use cases. And also a little bit about some of the really cool things in it that I think are quite distinctive and certainly worth a little bit of extra time on. And hopefully we'll have a little bit of time for questions at the end. So you probably won't have heard of our company unless you're in this domain in the niche that we work in. We're actually about 25 people based in the UK, Poland, and France. And the company was formed about six years ago in order to work in this space of public domain scientific cloud infrastructure. And I think we started out to do this and we've been doing it for six years. And I think it's been going very well so far. So everything we do pretty much is open source and in an open community. And Asimov is no different to that. So you might think, well, why would we do this? Why would we need to do this? Why would we want to think about doing this? And I guess part of the reason for that is this concept of dynamic high performance computing. So this is what some people call like sort of the next generation of high performance computing or just research computing in general. And the idea is that we start to use open stack infrastructure, software-defined infrastructure in order to compose compute platforms for the scientists and the end users that we serve, simply to give a little bit more flexibility in terms of the kind of platforms that we can offer. So conventionally, historically, providers of research computing services like universities and institutions generally tended to serve a fairly sort of fixed range of functions, such as batch queued environments or general sort of high performance computing runtime systems. And that's no longer true because of things like machine learning and the sort of explosion of life sciences and other faculties within research institutions who need completely different platforms. We have to or institutions have to provide this kind of flexibility. So you get this idea that upon our software-defined infrastructure, which is where things like OpenStack and Sef come in, we start to provide a rich ecosystem of science platforms and actual sort of software on top of those. And we have to be able to do that usually with the same people. So an HPC services group or a computing services group now gets stretched really far. So it might look something like this. And so we can see straight away that there is a lot of trouble coming in here because we have a huge explosion and sort of the breadth of environments that we might want to provide. We might need to provide a Kubernetes environment and a Slurm environment. And that's standard. People will often, people will always expect this actually within their OpenStack infrastructure when they're looking at cloud computing. So how do we provide all of these different environments on a single common infrastructure substrate without burning out the people in the team? And so this is where we have to bring in things like cloud native automation techniques. And this is the kind of system that Asimov provides for us. So I think in particular, we're going to be focusing on this piece. So Asimov is sort of limited to its jurisdiction is this red box here. And I'm going to hand over to Matt now to talk about some of the other motivations. Cool. Thanks, Stig. So yeah, so what we want to do is basically get researchers doing their science as quickly as we can. But the researchers, they're not, they're asking for different kinds of platforms than they used to now. They want to use the best platform for their workflow. But they don't want to be a CIS admin. They don't want to be a networking expert. They don't want to be waiting. But they also don't want to sacrifice the performance that comes with a specialized system. And as an operator, what we want to do is not incur a heavy support burden of having to manually deploy these systems every time. But we still want to keep our infrastructure up to date or at least help our users keep their own infrastructure up to date by making it easy. So we came to the conclusion that the ideal solution is kind of more opinionated than infrastructure as a service but less opinionated than platform as a service. So we want to be flexible and dynamic and be able to offer these platforms on demand. So I actually used to work for a facility called Jasmine before I worked for StackHPC. And they have an open stack community cloud where they've successfully established this pattern of on-demand platforms. And we've been building on this work. And so what Asimuth provides is a simple self-service portal that you give research. The idea is that you're able to give researchers direct access to this portal for them to manage their own cloud resources. And it has platforms as a first class citizen. But the available platforms are curated by the cloud operator. So the operator retains some control. And these platforms are optimized for HPC and AI use cases in the deployments that we have done so far. But like Stig said, there's nothing particularly HPC specific about the system. And access to these platforms is streamlined using an application proxy that we've developed called Zenith. And this allows us to expose services without consuming a whole load of floating IPs actually, so it is, which is quite often a limited resource in the clouds that we work on. We're currently targeting open stack clouds. We'd like to target other clouds in hybrid scenarios and things like that. And the developments that we've done as StackHPC have been funded by the Iris collaboration so far. And so like we've both said, the overall aim is to reduce the time to science and reduce the operational burden of onboarding and supporting users for the operator as well. And so I'm just going to walk through a few of the use cases that we've got. So the first use case is the one we like to call Big Laptop. So this is a researcher has some code they've developed on their laptop. They need access to a machine with specialist hardware. They just want to get to that machine really easily. So we've really stripped down the machine creation dialogue from Horizon literally just a name, an image, and a size. And then we've also added this option to enable a web console access, which isn't VNC. So what that does is they create some machine, eventually becomes available. You get access to this option in the dropdown. Note that there's no external or floating IP associated with any of this. And then you click on that link, eventually the web console becomes available. You'll notice it has a funky domain. This is a Zenith application proxy endpoint. And then from here, and Zenith will also handle things like TLS termination for us. So that's nice. And then eventually the user gets into their remote desktop. So this is a web desktop. They also can have access to a web console. And this is a machine that has their GPU or whatever it is that they wanted. So how did we do that? So there's nothing special about Asimuth. It's just an open stack client that authenticates with the credentials that the user gives it. We provision machines with some extra metadata. And we ship a bunch of images that know how to deal with that metadata to do different things. And one of those is start up a guacamole web console and then register the machine with Zenith. And by the time it's done all that, the user gets to access their machine through the web browser without consuming a floating IP using the application proxy. The next use case we've targeted is the sort of dedicated slurm cluster. So an example of a use case we might see here is maybe the site does have a big HPC, but the queue times are long or they need some specialist packages or their work is real time and they don't want to wait in a queue. But they do want a batch cluster to execute an MPI code or something. But they have a cloud allocation that they want to be able to make use of. So we have this system, which we call cluster as a service. Slurm is the only appliance that we, as StackHPC, maintain at the moment. But it's actually just ansible playbooks in Terraform. So it can support different cluster types. We only have slurm at the moment. And then once they've selected their cluster type, they get to set some options and then click Create Cluster, goes away, makes the cluster, makes a bunch of machines. And then eventually the cluster becomes ready. You get access to the cluster details dialogue. It tells you how to access the cluster via SSH. If that's what you want to do, there's also some services that we have on the slurm clusters that are exposed again using Xenif. So this is open on demand. For people who've seen that before, it's an interactive web interface for managing slurm jobs. And we also present some cluster monitoring so people can see what resources their jobs are using. So yeah, again, how did we do that? So we're using ansible, a bunch of open source technologies plugged together with a little bit of code, basically. So we're using ansible driven by AWX, which is the open source version of Tower. We're using Terraform with the state-stored in console. Our slurm distribution is open HPC. We're using open on demand. And then the metrics are done using Prometheus and Grafana and Elasticsearch as well, actually, which isn't on this slide. So the way it works is Asimov makes inventories and jobs in AWX. That invokes an ansible playbook, which invokes Terraform, which deploys the infrastructure. It's then adopted into the ansible inventory, and then we configure it as a slurm cluster. And like I said, the user can either access this via SSH, or they can use on demand via our open on demand via the Xenif application proxy. And then the third case we've kind of targeted is applications on Kubernetes. So the big one for us so far has been Jupyter Hub. And that's what everyone wants now. And so we wanted to make provisioning Kubernetes clusters easy. So the example use case here is a project wants to use Jupyter Notebooks for their data visualization. They've got a cloud quota. They want to provision a Jupyter Hub. Maybe they want to use Dask. Maybe they want to mount some specialized storage, those kind of things. And so we've really, we've present a simplified dialogue here. There's a concept of a template. And the template is what defines the cluster version, the Kubernetes version, but it also can define other specialist things about the cluster. So for example, if you want SRIOV networking, that would be part of your template. And these templates are defined by the operator. So you can do things like tag. Say you wanted an SRIOV network to be attached. You can tag a network in the project as providing SRIOV. And your template can say, use the network that's tagged for SRIOV for this cluster. And so we can do things, we can do high performance networking in these clusters without the user really having to worry about it, which is nice. So you can have one or more node groups. The node groups can be auto-scaling. And then you go away and you click Create Cluster. There's some add-ons as well. Sorry. This, the point is, okay, there we go. So you click Create Cluster, it goes away, makes the cluster. And that makes a bunch of machines. They all have funky names. And then eventually the cluster becomes ready. You can easily get hold of the kube config, copy it to the clipboard, download it, whatever you want to do. One thing to note here, actually, is the API endpoint is actually also a Zenf endpoint here. So we're not needing to consume a floating IP, even for the API server, to do this. And then in the cluster details page, you get, obviously, a bunch of information about the state of the cluster. But you also get a bunch of services accessible again via Zenf. You can probably see a pattern forming here. And so these services are things like the Kubernetes dashboard, Grafana monitoring. And then that was Grafana in between. This isn't very responsive anymore. So yeah, that's the Grafana monitoring. And then the third thing which we're making available is the kubabs application dashboard. So this comes from the Tanzu project. And what this is, is basically a user interface around Helm charts. So we can present a bunch of Helm charts to our users. We have a repository of Helm charts that we include by default, which have Zenf support integrated into them. And so the user can pick a Helm chart, and they can configure it using a nice little form. Metadata goes with the Helm chart that defines what the form looks like. And then they can click deploy. And it goes away, and it deploys the thing. And then eventually you get, again, another Zenf endpoint. Your Jupyter Hub comes up. The nice thing here, actually, is Zenf will also pass the open stack username of the authenticated user in the remote user header. And so we've configured Jupyter Hub to understand the remote user header. And so each open stack user actually gets authenticated as themselves when they get to Jupyter Hub, since they get a separate notebook server, which is nice. So this is just demonstrating that it mounted our luster file system that we had. And then this is like a noddy expensive computation that I was using just to demonstrate autoscaling. There is a video here. I don't know if we've got how much time have we got. Enough. Can you press play on it then? Thank you. So this is just DASC. So it starts off with one worker. And then we should see the autoscaler push the cluster out by spawning new machines. And then eventually those machines become ready. And then the Kubler has to start up on them. And they have to be registered with open stack. But they have to have the initialization done by the open stack cloud provider. But they then become available as additional workers for DASC. And you can see just the computation just speeds up. And then once the computation is over, the autoscaler is smart enough to understand that it can scale those nodes back down again after a configurable timeout. But by default, it's something like 10 minutes. Oh, we've done that. So again, how did we do that? So we're not using Magnum actually to do Kubernetes. We're using a project called Cluster API, which is like Kubernetes clusters inside Kubernetes clusters. And so what you do is you provision, you create Kubernetes resources inside a management cluster which define your workload clusters. And it uses all the standard tooling from the Kubernetes ecosystem, which is not something that Magnum does at the moment. So things like Kube ADM is used to deploy the cluster. And the other thing it does is uses an immutable infrastructure approach for upgrades. So upgrades are done by deleting nodes and replacing them with a new node one by one in a rolling process. And so by default, we have monitoring and logging with Prometheus Grafana. We're using Loki for log collection. And there's a Grafana dashboard that shows the Loki logs for pods. We're using, like I said, the Kube application dashboard, which comes from Tanzu, and all these dashboards are available via Zenif. And like I said, we have support for GPUs, high-performance networking and things like that. These are kind of use cases that this can facilitate transparently to the user as well. So the key to a lot of the ease of use with this is Zenif, which is this application proxy that we've developed. There's only one slide on here. I keep promising this blog post, which I should definitely write. But what Zenif is, is a tunneling HTTPS proxy. And what it does is allow us to expose services that are behind a NAT or a firewall without consuming floating IPs. And the exposed services only need to be bound to local host. In fact, in some cases, we can do better than that and only bind them inside a pod, for example, for the Kubernetes services. And it performs TLS termination for the proxy services. It also provides authentication and project-level authorization for the proxy services as well. And this is all built using industry standard tooling with a small bit of glue code. It's just SSH port forwarding. Basically, we've got a locked-down SSHD server that only allows these tunnels to be established. And it runs inside Kubernetes, makes heavy use of the Ingress, the Nginx Ingress controller, and we use console to do some of the service discovery. But there will be a blog post about this soon. So just summary and future plans, just a few things about where we're going with this. So we hope that this is demonstrating that we're trying to lower the barrier to get for researchers to get their science done. And we've got this deployed in a few places where we're working with the research teams there to make sure that we're supporting their use cases effectively. One of the main plans that we have is to have a first-class representation of the applications directly in Asimuth. So the piece that we're using Kube apps for, we used Kube apps because it got us where we wanted to go quickly, but we're kind of realizing that it doesn't quite fit with all of the use cases that we need. So by having a first-class representation of apps in Asimuth, we can do things like integrate intelligently with data sets and storage systems, which is one thing that our users want. And so they'll be able to do things like say, I want a Jupyter Hub that has these two data sets available as mounts inside my pods, for example. That's the kind of thing that we're looking to target to do. We want to be able to do the same seamless integration of the accelerated hardware that we do in Kubernetes with the Slurm clusters, which we can't quite do at the moment. And so we're talking about GPUs and SRIV and RDMA, those kind of things. There's a bit of operational hardening of the platform itself that needs doing, so disaster recovery, monitoring, all the boring stuff. But the stuff which is critical, if you're running it in production, right? And then we, like I said before, we have an ambition to do some hybrid and public cloud stuff, and there's loads and loads and loads more ideas, but these are the main ones. And then just to say, come join us. It's all open source software. Those are our repositories. We'd love to collaborate if you have access to or are an operator of an OpenStack cloud, and this looks interesting to you, let us know. We're happy to chat. We bug reports and especially patches are welcome, you know, and get in touch with us. There's our website. If you come see us, we can give you more details. Just to recap on a couple of things there. The user experience, they log into their platforms through a web portal. They authenticate once to a university's ID provider or an OpenID connect source. And then from that single point of interface, they can access all of their compute resources, which are otherwise not exposed to any public IPs by default. For the operators, they get to avoid having an industry of tickets around creating new compute platforms for user requests. The self-service is defined by a series of Ansible repositories that are used to deploy the clusters that are provided as an application catalog to the users. And perhaps from the security officer's point of view, there is only one point of ingress into the system, which is through the Zenith application proxy via azimuth as well. So we think it brings a lot of useful features to a system and from our experience, this level of self-service and yet with a steady hand over it from the admin's point of view in terms of prompting for maintenance and enabling the users to keep their deployments up to date, but more importantly out of the public internet. These are all quite strong points that we think in favour of the way that this thing is working. Cool. So I think that's all we wanted to say. So I hope we've got a bit of time for some questions if anyone's got any questions. Mr Heikinen. Yeah. Okay, so I think I'm paraphrasing a bit, but the question I think was basically how easy is it to add another cluster type if you've got pre-existing Ansible playbooks for deploying a system? Is that? Yeah. So the way this works is basically the Ansible playbooks live in one or more Git repositories. These get added as projects in AWX and then the individual playbooks get added as project templates and then Asimov is able to query the available cluster templates using the AWX API, but you can add as many of those templates as you like. There are some small constraints on how the playbooks operate because they also have to be able to interact with... There's a layer that has to be able to interact with Terraform to provision the infrastructure, which we've written, but you have to basically expect your Terraform outputs from that stage have to have a particular shape for the code that we've written to be able to adopt them into the Ansible inventory and then from then on it can be any Ansible that you like. Yeah. There's a minimal set of assumptions about inputs and outputs. There's also the metadata for the form. When a user wants to create a cluster, something has to provide the list of questions and parameters for that form, but otherwise, yes, it's intended to be extensible. Yeah, so the bit Stig was talking about there is... It's probably easier to find it on here, actually. Go for it. So this form here... This form is customizable on a per cluster type basis, and the way this form is... The way the content of this form is defined is using a YAML file that lives in the same Git repository as the playbook. So, yep. So the question there was, are we planning to do first class support for bare metal for the batch clusters? And we could equally do it for Kubernetes as well. Yeah, the answer is yes. I mean, really the integrations between the sort of platform services that Asimuth provides and the infrastructure beneath it, whether the nodes of bare metal or not are sort of wrapped up inside the infrastructure anyway. So there isn't any special consideration that I'm aware of other than, you know, it takes a little while longer to make them for bare metal nodes in these systems. So the idea there is that if you selected a bare metal flavor for your workers, you would get bare metal hosts. Yeah, so... Yeah, cool. Okay, thank you very much. Thank you.