 I hope you all had a nice breakfast and things. Excellent. So I'm John Garbert, I'm a principal engineer at StackHPC and I'm here with Matt Pryyer, who also works with me at StackHPC and today we're going to talk to you about self servicing and low key applications for non-technical users. You'll have heard hopefully in the keynote you've heard a little bit about that. We're going to go into a bit more detail on the azimuth side of things. So just a quick introduction to StackHPC and who we are. We've formed in 2016 and based in Bristol, although ironically neither of us are in Bristol with the other peoples in the UK. So I'm over in Cambridge and we've got colleagues in France and Poland. And really we've been looking at how to bring scientific computing and cloud and HPC and bring all these things together and make things work. And the way that we work at StackHPC is always open. We're a big supporter of the four opens and doing most of our work as Apache licensed and a lot of it within the open stack community and related communities. So what do we really do? This isn't just an excuse to have a picture of Lego, that's cool. So we work with our customers really to look at the kind of infrastructure they need, help them pick the right components and how do we put all these things together to sort of get the most from your investment, investment in the hardware, the infrastructure, the people, how do we get the best out of all of that? So you've seen these three pictures before probably, slightly different take on them. I think it's an interesting way of looking what we do with Loki at StackHPC. So you know the Linux open stack and Kubernetes infrastructure. So the first part of that is a reconfigurable kind of conference room. So if you plan ahead, you can build a system where it can be one big supercomputer or it can be lots of little pieces doing completely different things for different people. But you can't do that if you don't plan ahead. Like if you don't actually, when you build the thing, you don't build it that way. So there's a lot of work to try and make that dynamic infrastructure and make the most of the reconfigurable infrastructure that we get from Loki. In addition, this is also done with isolation. I particularly like this picture because those of you who've been to PTGs or you know, all their conferences and this kind of thing, you sort of, I think it's a good analogy for the noisy neighbour problem, like when people are having an argument or design discussion in the other room. So you ought to be careful about how you get that isolation working. So I said HBC in high performance to get the most out of the infrastructure and your, the most value out of what you're doing. Often we need to really try and optimise performance. I've taken a picture of the red arrows here just to show sort of, you know, that it's like a display team to have all the things working together to get the really good performance, get rid of that. Bugger off at App Store. Excellent. It's the great thing if your laptop's still in UK time. It really, the updates in the middle of the day here. Anyway, so yeah, high performance is the most value. Now, all of this isn't very useful if people can't actually use the thing. How do we sort of cookie cutter out these optimised stacks? How do we make it easy for people to just walk up and, you know, use what's required? And that's where Asmouth comes in to help give you a catalogue of these things that you can pick and deploy. And you do that in a way that uses all the optimisations for that local infrastructure. So you can get all of that good practice and make that available to people. So we've been doing this for a while at StackHPC creating reference platforms with people, creating slurm clusters and creating all these things. But in the last year or so, in the last couple of years, getting to the point where there's a level of maturity that you can actually self-service these things, press the button and expect it to actually work every time. So we haven't gone on this journey alone. I said we work open first at StackHPC. That's definitely true. And part of that really helps work in co-development with our customers. So Asmouth started at Jasmine and also in very much ways started with Matt. So I'm going to hand over to Matt who's going to take you through some of the sort of history of Asmouth and then go into the details. Cool. Thank you, John. So, yeah, I was the original developer of the Jasmine Cloud Portal, which was the pre-runner, the forebear to Asmouth now. So I moved from Jasmine to StackHPC a couple of years ago and we've carried on developing with all these people and others as well. But really we wanted to talk about why we're doing this. I mean, you know, there's a legacy way of deploying applications. People, users used to put apps on their laptops. And you know, the more things you put on your laptop, the more likely you are to get in your dependency hell problem and it becomes a massive support burden on your IT department. And then, you know, you get in especially with research workloads which is what we primarily work with. Some of these require specialist hardware like GPUs or network accelerators. And then, you know, what happens if your application needs a Kubernetes cluster or if you want a slurm cluster or all of these things are really complicated beasts to deploy. And if you leave it up to users to deploy these things, they'll probably get something wrong and you'll get a support request. So we kind of, like John alluded to, we've got OpenStack potentially gives us a better way. So it gives us this shared configurable hardware pool and it gives us APIs that we can use to manipulate that. And then we can build our application stacks on top of that and spanning across all of this is the cloud native automation tooling. So DevOps tool, the DevOps toolkit in this slide. But there's a load of options here. So it's sort of how do you tame this complexity? So this bit's OpenStack, roughly speaking. There's obviously more components than that. But this is the space that Asimuth is trying to fill. So we're trying to do Kubernetes and slurm. And then, but with a focus on the applications on top of that. So like I said, with Asimuth, we're trying to tame this complexity and make sure all the abstractions are in place so that people can do the things that they are good at. So that means users self-service the applications that they need. But they don't need to know how to install those applications necessarily. They just need to deploy the thing and then get on with their job. And the platform engineers who understand good DevOps practices get to maintain a catalog of optimized applications that include, like John said, again, optimizations for their site. And then, again, this lets us share the good DevOps practices. And all this is enabled by the low-key APIs and the cloud-native automation tools that we've all sort of grown to love. And the benefits of this is that your users become more productive and your support burden on your IT department is reduced. And so we end up extracting the most value from the hardware and from the people. So what does this look like in Asimuth? So this is the Asimuth user interface. This is the deployment at Jasmine. So this is where you come in. So you sign in with your open stack credentials and you come and we support the federated flow as well. And this is the page that you come to and you pick a platform from a catalog. You configure that platform. So I picked a Linux workstation. So I just pick the size. I pick the amount of storage that I want to attach to it. And then it goes away. It provisions it. Eventually it becomes ready. Each platform gets a sort of a details dialogue. And this text is all populated by the appliance itself. And then these services become available for the user to click on and then they take you through to your web console. You'll notice up here we have, I don't know if the laser pointer works. Up here we have a sort of random domain. And this is how we're exposing applications to users. So we have an application proxy that's tunneling. So we don't consume any floating IPs to do any of this stuff. So the applications will punch out and then they get assigned one of these domains and the traffic goes back down to the application that way. And that's what this Zenith application proxy does. Here's the happy user. We take advantage of Ingress in Kubernetes to do this. So Ingress gives us a really simple way to get a reconfigurable HTTP proxy, basically, dynamically reconfigurable. And then this Zenith server is basically a heavily customized SSHD server. The Zenith client is a heavily customized SSHD client and then there's a proxy application underneath. And this is just a bit of glue code around some industry standard software. So open SSH, Kubernetes, nginx console. And the nice thing about doing this is it lets us do, lets us run our applications in private networks behind a firewall and or and or and that. The clients, there's a way of pre-establishing trust for the clients. So you have to be pre-authenticated before you can start a client connection. But at this, at the top here in the Ingress controller, we can enforce SSO and TLS. So that's what we do. There's two kinds of applications you can make available through this catalog. At StackHPC, we provide some reference applications, which you saw in that catalog slide. But anything that's expressible in one of these two ways I'm about to show you can be made available as an appliance in Asimuth. So the first stack we have, and this is how we do the Linux Workstation and Slurm, is a Packer Terraform and Ansible stack. So this can do single machine or clustered applications. We punch out the infrastructure with Terraform, and then we adopt those machines into an Ansible inventory and we configure them with Ansible. If you want to speed up the provisioning, which we do with our Asimuth appliances, you can pre-build images using Packer with all the dependencies on. So we do that for Slurm, for instance. And then any web applications that you want to run, like monitoring, like access to Grafana, or open on demand for Slurm, or the Guacamole web interface that we just saw, those are exposed using the Senate application proxy. And the form that you present to the user is customised using some metadata that lives with your appliance code. We've got a little sample appliance here. Each appliance lives in its own git repository with its own version control. And you can do... And you just tell Asimuth where that git repository is, which version you want to use, and it will pick it up. The second kind of... The second version of apps that we support is apps that deploy on top of Kubernetes. So firstly, we support deploying Kubernetes. And to do that, we use cluster API. I don't know if anyone was in the Magnum talk yesterday, but we spoke about cluster API at length then. It's basically using Kubernetes to deploy more Kubernetes clusters. And it works really nicely. And then on top of that, we deploy applications using Helmcharts. So any application you can express in a Helmchart, you can deploy using Asimuth. And again, we have a way to expose web applications from inside the Kubernetes cluster. They only need to have a cluster IP, and we can expose them out using Zenith. And we generate... In this case, we generate the forms from the Helm values schema that goes with the Helmchart. And the final bit of sort of magic source that we have is that we can actually grant access to these platforms to people who are not OpenStack users. So the pattern we're seeing in our research... In the UK research domain is... There's a big push on these things called research software engineers who are supposed to be sort of more technical people but work with the scientists. And so the kind of... One of the patterns we're seeing is that the research software engineers might have access to Asimuth and then deploy platforms for their group. And then they might grant access to members from their group to the individual platforms as they need it. And another thing we're seeing is things like people who want to run a workshop on Jupyter Notebooks, for example. The person who's running the workshop might deploy the Jupyter Notebook and then grant access to workshop participants without them having to have an OpenStack account. And the way that works is we've integrated with KeyClok. So there's an identity provider page in the Asimuth UI that you click and it takes you through to your KeyClok realm. Each project gets a KeyClok realm and the people who can access Asimuth are realm admins within that KeyClok realm so they can do things like create new users or add new federations. So there's a new user, JayBlogs. There's a link within the Asimuth interface. There's a button to copy the URL and then you can send that to your users. It's not moving on. When they go to that URL, they get redirected to the KeyClok sign-in page. To start with, they get forbidden because they need to be added to the correct group. So this is the group. You add the user to the group and then they can get into there. They can get into the web interface of whatever the platform is that you've given them access to. So we just wanted to run through a few of the ways that our customers are using this. So Jasmine is the first customer, in a way. They operate a community open stack cloud which is primarily used for earth sciences and they're using our standard appliances at the moment. So one of their heaviest use cases is DASCUB. So they're a Python shop and DASC is a framework that allows you to distribute Python applications really, Python calculations really easily. So one of their typical use cases, like I said, is running training courses using Jupyter Notebooks. This is only these guys. And Jasmine funded the development of this application, this external application user capability just so they could do this kind of stuff, basically. And the second customer I wanted to talk about is CFMS. So this is the center for... Basically, they do modelling and simulation, like CFD types things and center for modelling and simulation, I think it is. And they have an open stack private cloud that they use to provide compute services to their clients where they can self-service modelling and simulation applications. These are often Windows-based. And so CFMS actually worked with us to develop a Windows workstation appliance. I think they now, they've carried on using Asimuth since we were working with them and they've actually also developed like an active directory domain appliance that deploys an active directory and a bunch of workstations all connected together. Yeah, it works really nicely for them. So this was a Terraform and Ansible appliance which uses Windows images with RDP enabled and then to do guacamole because it doesn't run on Windows. We have a tiny little Linux sidecar that gets deployed as part of the same application. To the user, it all looks like one application. And then guacamole is exposed using Zenif. So I'm going to hand back over to John to talk about our work with Graphcore. Thanks, Matt. So I'm not going to go through in too much detail what we're doing with Graphcore because actually the next session in the room just over there is all about what Graphcore had been doing, going into the details of how they're using open stack and how they're using Asimuth. You saw some of this in the keynote. But I think let's also talk about... The core use case is really development environments. So how do you quickly spin up a development environment and so people can get going and get working with, in their case, specialist hardware? So this actually is a use case that came up an awful lot in terms of e-infrastructure in the UK for the science research. And one of the phrases that kept coming up is how can I get a bigger laptop? Like my laptop keeps running out of memory and getting hot and I need a real GPU. Like how can I just get one for a little bit to test my code on the bigger system? And there's a big barrier to entry if people just go up to Horizon and go, you're asking me all sorts of questions I have no idea about right now. So this is where Asimuth came in. So one of the things that we did to adapt Asimuth for this kind of development use case is that guacamole access. So one of the things to highlight about that is that if you start asking some users to give me an SSH key, very often, right, they'll give you the private, well they'll send the private key to you in an email and keep the public one and all that kind of thing. And this, like if you're not sure what on earth is going on, it's a perfectly reasonable mistake. So it's kind of, like how do we get that secure? And that's where Asimuth comes in, like there's no SSH keys. We're not bounded by the amount of public IP addresses that are available for this thing because you know you go to the floating IP and someone else has nicked it, like no left. So that's why we've got, that's why Zenith came about is to have that ability to just have lots of little development platforms and getting that infrastructure ready and getting to it and not just getting stuck at the first hurdle if you're asking me very, very difficult questions and I have no idea what you're talking about. That's where that came from and that was a long process of sort of giving users access to a thing and saying, does this help? They're like, yeah, it's great, but I need a volume. I need some more storage than you're giving me in this flavour. So if you saw that dialogue box asking for the volume, that's where that came from. There's another version of that where you actually do get a floating IP in SSH because for some reason people want to r-sync their data everywhere so they needed a public IP, but not forever all the time. So that's the kind of journey we've been going on. So I alluded to this problem with fixed capacity clouds. So when you log into azimuth, we've made a rash assumption that there's resources available for you to go and create something. That's not always true. Actually maybe just a quick survey. Are there people running sort of private clouds that you could call a fixed capacity cloud in the room? Okay, a good few people. So you've probably always, you've had someone squatting on your GPUs, right? They got hold of the GPU and they're like, success! It's mine! Quoters are really good at making sure you can only get one of them. I'm saying that flippantly, of course, because current quotas don't do that very well, but unified limits in yoga do, it's like an advert. But yeah, that's a quota is a very public cloud concept. When I was at Rackspace Public Cloud, the quota was roughly equating to, well, if the credit card bounces, like how bad is it? And generally the way that you got your quota up is you prepaid some credit because you're like, well, got that much. So, and that's not the case with fixed capacity clouds, right? So what's the alternative? The alternative is another pretty picture. So we try to think of an analogy of what this would be like in a nice world. So what's this sort of coal reef cloud where everyone's living together nicely? I suppose being from the UK in a bit British, you know how do you ask permission to sort of have it tomorrow and that kind of thing. So we've been working with a blazar project. So blazar allows you to go up and ask tomorrow, could I have two GPUs please? And it very politely says, no, they're all gone tomorrow too. And then you can ask us some more. So one of the active things we're developing on right now is that when you go to your platform, you request I want a big Jupyter Hub or I want a bigger laptop with 10 terabytes of memory please because I'm doing astronomy and it's a really big image or whatever it is, you pick the platform, you pick the size, then you get another dialog box where you pick where and when, like when is that possible so you can actually schedule your appliances. The first small step of this that we have done is you probably noted in the keynote that we put a minimum lifetime of the appliances. So that's like the first sort of step towards limiting VM sprawl. It's like, well, you've got it, but you've only got it for 24 hours. So, you know, that's good. Naturally people moan that they have to now every 48 hours or 24 hours come back and get another one. But then you get that sort of natural balancing. But if we can do, we can do something better where we have like a concept of credits. So like CPU hours, give people credits and they can when they book a reservation, they can't book a reservation more than their credits. We can take this a step further. One of the downsides of doing reservations is that there's space in your system where it's not being used, right, in between the reservations, the gaps. So that's where Blazar is looking at preemptible instances. So if you've got like a desk system that can deal with workers coming and going a bit, you can sort of autoscale your Kubernetes into some of those holes, you know, limited by your quota. Another piece is looking at changing Blazar so that when you request a reservation, when you request a reservation, you use a existing flavor to say, I want a green one, a blue one, please, rather than I want four CPUs and this amount of RAM and then cause massive fragmentation. So there's a bit of a pivot there. Okay, conscious of the time. So just going through a few of the other plans, I guess one of the calls to action here is if this is a kind of use case that's interesting to you, do get in touch. All of this is open source and by open source, I really do mean like trying to follow the four open stack opens. It'd be great to get more people involved in this. So one of the big pieces of work we're doing at the moment is we have people running this in production now. There's a lot of operational enhancements and getting on that nice loop of, well, we need that extra notification so we can spot when that thing goes wrong and have the run books and things that people use to operate this system at reasonable scale. As I said, we've done the preview of the maximum application lifetime and there's lots of other bits of work in progress. So things like making sure that the environments in those bigger laptops and more batteries included to use the Python kind of terminology. So looking, we've added ESSI in there that can help get L-mod modules, although actually that's journey in the slum right now, but trying to, you know, have an aptaner and podman and these kind of things available. We've done optimisations to make sure that you can have GPU workloads and Kubeflow and OFED using a lot of the operators in Kubernetes. We've just been through a lot of the reservation pieces and generally one of the other things is that there's this onboarding life cycle of people wanting to create science platforms or applications and run them on your system. So what we're looking at here is that people can come into Asimuth, create some platforms and go, well, that's almost what I need and making the flow from getting the almost what I need to being able to use all of those optimisations that you've got on your site, tweaking that and potentially even running that outside of Asimuth, doing things like ArgoCD so that you've got, so we can basically have a set of recipes that says if you've got OpenStack and you want to kind of run these systems on there, this is a nice way in which you can start, start from the ground running, get a leg up there on how to do that. So it's a good way of sharing good practice. They might not be the way that you end up doing it, but there's an awful lot of reinventing the same thing that this is trying to help eliminate. The reverse of that is if you discover a really cool way of doing it, you can contribute that back too. That's awesome. So I'm conscious we're 26 minutes past and we should probably go on to a few questions hopefully. But the call to action here is do definitely get in touch if you've got questions or you want to try it out. If you want to try it out, there's that URL there. Architecturally speaking, Asimuth is very like the OpenStack CLI in that you give it your OpenStack credentials and it makes OpenStack API calls. So you and your project can actually use Asimuth on any OpenStack cloud you've got. Give it a whirl. It's just like a little all-in-one setup. It tries to guess really hard based on the access you've got to OpenStack and the kind of things you want to set up. But the documentation gives you some hints on telling it what's actually true. So that's definitely worth a go if you're interested in this thing. There's a whole bunch of GitHub things. One thing that is interesting, I guess here, is that we're using Helm charts to stamp out the Kubernetes resources. So there is actually several people have taken the Helm charts as this is a recipe to stamp out Kubernetes using cluster API and OpenStack. So rather than having to know all the resources, you just give it a set of Helm values and press go. And the interesting thing there, and you should totally listen to the recording of Matt's talk, Matt Muhammad's talk yesterday, there's a lot of add-ons that you need inside that are on top of just vanilla cluster API. Things like the GPU operator and that kind of thing. And a way to get RDMA inside the pods. So you've actually got RDMA working inside those Kubernetes pods as well. That's a whole different talk. Let's not go there. But yeah, thank you very much. Thank you very much for listening. If there's any questions, do you come to the mic? Yeah, do you want to use the mic just because the nice people listening to this afterwards? Yeah, that's fine. I guess whenever I've worked with research computing engineers, I always see like cloud bursting as being kind of a big ask. Have you, do you get a lot of, I guess that solution is probably more at the application layer than infrastructure layer, but are you getting a lot of requests for that? So on the first introduction to StackHPC slide, I skipped over the bullet point that said we're working on hybrid solutions. So it's almost like I planted you to ask that question because I forgot to mention it. So yes. In one level, what we're doing in azimuth is cloud bursting because for a lot of these people, they're using the common infrastructure that's offered to them, not necessarily the home standard thing. For a lot of these people, their home standard thing is a laptop, right? So this is the cloud bursting. That's definitely cheating, right? But the idea here is that when I talked about, you pick the size of your platform and then you pick where or when. I didn't go into the where. I was talking about the when with the reservations. But I think the where is that there may be multiple regions here and one of them might have a blazer reservation, not the other. If you're doing a simulation, then you have to move data and that would be easier, right? And there's also a where might be some other non-open stack cloud, for example. So actually a lot, if you look at the technologies we've picked here, we've picked cluster API because it's fairly cloud agnostic. I say fairly, right? Because everyone's got their own stuff, but you're talking the same kind of thing. It's very similar to Terraform. The Kubernetes looks the same as well. That's the most important thing because whatever infrastructure you deploy on with cluster API, it's Qubadm that's managing it and it looks the same and all the same add-ons will work. Yeah, and like John was about to say, Terraform is similar in that sense. So we've deliberately made infrastructure. We've deliberately made choices with our automation tooling to make multi-cloud easier in the future and we are going to be doing that soon, probably. So yeah, absolutely. It's on our roadmap. Are there any more questions? That's a good one. Actually, we're out of time, I guess. Good. Excellent. Thank you very much, everybody.