 Great, thank you very much. So hopefully everyone can see my slides being shared. If not, someone interject, but if everything's going to plan, I'll get on. So today I want to talk to you about a product or a piece of software that I've been working on for the last two years or so, which is designed about giving people access to cloud resources in a way that people particularly in research are more used to. So I'm gonna start by telling you a story about how this all came about and then dive into some of the more technical details. But I work at the University of Bristol as a research software engineer. And one of my jobs there is supporting researchers making sure they can get the most use out of whatever computer resources they have at their disposal. A lot of people in research aren't experts at programming or sysadmining or any of this stuff. And so my job as a research software engineer is to provide that bridge between the two so they can get the most out of things. But something we've been finding over the last few years is that researchers get approached by cloud companies wanting to give them some cloud credits as a way of starting to form an academic partnership so that the cloud vendor gets some more publicity around research and doing good in the community. And the researcher, of course, gets the benefit of access to more compute resource. So sounds like a match made in heaven initially. However, it does start to have a bit of a problem. So it starts off, the cloud vendors provide the researcher at a university, some research group doing vaccine discovery or molecular dynamics or something like this. They say, here's $100,000 worth of cloud credits. Go ahead and do the thing you do best, do your research. The problem is that the researchers are presented with interfaces, something like this. They log in, they've got their account and they see lists of jargon and technologies and stuff. They have no idea what it means. The average researcher who's an absolute expert in their field at doing whatever science or research they're doing doesn't need to until this point and has no idea what a subnet is, what a net mask is, what an internet gateway is or natting or any of these things. These are just so far outside their realm of understanding that it's just not something that they have ever had to deal with. And so they have no real good way of making use of the cloud credits that the cloud vendors are providing them with. They just don't know how to get started. Maybe they've got someone in their group who can go ahead and create a VM, but creating a VM, which is running Ubuntu or CentOS isn't going to compete with having access to some kind of institutional or national or international batch computing resource or something like that. So we came to this with a want to try and smooth over this divide and try and help out these researchers with the stuff that they're doing. So what the people who work in research know, they know their research. They know their field of science or technology or engineering or whatever research they're doing. They know that brilliantly. They're top in the world at their field. And they know some set of tools, software tools that are going to be used to perform their research. 90-odd percent of research involves research software in some way. So everyone needs to know some level of software and they tend to become a relatively local expert in the particular tool that they're using. If they're a little bit more advanced, and this is the audience that we're focusing on, they're also a little bit experienced with batch computing in some way. They've used the university's resources to submit batch jobs and run their research on a larger scale than their laptop can provide just by itself. So if someone's in that situation where they know the research and they know some software and they can run it on a cluster, but they want to get access to the cloud, there's still a big jump for them to be able to do it. They need at least someone to set up a cluster, whether it's using an existing product or whether it's to create one from scratch. But they can't just do it inside themselves. We can't expect the researchers to be their own sysadmins. It's best if we devote resources to one particular set of skills and you get the most out of it. And at Bristol particularly, this is where we in the research software engineering group sit. We are there to provide this bridge for the researchers to get the use out of the cloud resources. So the solution that we came up with is to give them the environment they're used to. They're used to logging in through putty or terminal, SSH'ing into a PBS or a Slurm cluster and doing espatch to submit their jobs, waiting for the results and downloading them. So we want to give them something that looks exactly the same as that. From their perspective, their monitor model is it's going to look exactly the same. They don't have to understand really the difference between it being local or on the cloud with the exception of thinking about data transfer, which is always an issue, but that's something we're working with. However, the benefits of it, particularly for them and if you've ever, you know, I've talked to researchers, the main thing they complain about is queuing. With the cloud cluster and the system I'm going to describe, there is effectively no queuing anymore. They can just submit a job and it starts almost immediately. They also only have to pay for what they use. Now, if they've got free point of view for the university, this is additional cost, but sometimes for quick turnaround, short-term projects, paying for cloud resources is a good way of getting that burst of resource that you need to get your research out quickly. And we've had people using it as I come to you later for COVID related research. And so having access to this extra resource so you can spin up your experiments really quickly is really invaluable. And so the solution I created is called cluster in the cloud. This was initially created for a collaboration we had with Oracle Cloud. They'd approached some researchers at the university who were working in molecular modelling and said, we want to work with you, we want to apply these in cloud credits, but there's like a deadline on when this cloud credits need to be spent. And so I was tasked with putting a system together and with help with some of my colleagues at Bristol, we had about a week to create something which can create clusters in the cloud from scratch in that week window so that when we got the resources from Oracle, we would immediately be able to spin up and start running their experiments. And in that time, I had to learn how to use the cloud, how to use Terraform, et cetera. So that was quite an intense time, but we got it done and since then the product has evolved. So at its core, what cluster in the cloud provides is a Sloan cluster. That is at the core of the whole system. It gives them a Linux terminal they can log into, of course, but the main interface to the cluster is through Sloan. The two main building blocks that come together to build it is Terraform on the bottom layer and listening to the magic castle talk yesterday. I mean, they've had the same ideas as us for a lot of this stuff. So the Terraform goes in and it creates the infrastructure. It sets up the networking, it creates a shared file system so that you've got a home area for all your users on the cluster and it creates a single shared management login virtual machine, relatively small and lightweight. And at the base, that is all that exists. It's a single virtual machine which you can log into and this helps keep those baseline costs really nice and low so that while you're not actively running any research, it's not costing you too much. On top of that, we use Ansible to configure all the nuts and bolts in the classic way that you use Ansible. So it sets up the management node with user management and SLURM configs. And we also use it along with Packer to create the compute node images. So those can be edited on the fly and dynamically updated if necessary. So I think from our perspective, some of the key features of Cluster in the Cloud are that it's a familiar environment for the researchers. They are mostly used to using SLURM. Maybe some universities, they're using other scheduling engines but the principles are the same between them anyway. We're also working in the room on getting a Jupyter Hub interface to it so that people can just log in through a web browser, use things like Dask and so on to do their work through a web-based interface which for some researchers is more what they want. It's really versatile. And so when I first started Cluster in the Cloud, I did look around at alternative products that are out there. The main problem with all of them was that none of them worked on Oracle Cloud which is where we had the money so we had to create something from scratch. But quite a lot of the Cluster solutions out there didn't allow you to have multiple types of nodes. You had a choice of what kind of node do you want this cluster to have and then you would be able to change how many of that one type you had whereas we wanted to be able to have some GPU nodes, some large, fat CPU nodes, some small ones just when you spun up quickly and then shut down again. We wanted that versatility. The second big headline I think for Cluster in the Cloud is that it's dynamic and that means that nodes only get created when they're needed and they get destroyed again afterwards. Of the cluster solutions out there that did allow you to have multiple types of node instance types, they generally weren't dynamic. It was fixed at the beginning, you said, I want 10 of these and three of those and it would make them and they would sit there and you'd be paying for them unless you manually go and turn them off or configure it in another way. The baseline cost for it is really cheap. We're talking tens of pounds a month or something which in a research budget is you can basically, you know what? You only pay for the compute resource and the storage when you need it. Another thing that started to allow it to get a little bit of traction is that it's cross-cloud. So it all started out on Oracle. They supported us by providing us with cloud credits at the beginning but since then we've been in contact with groups and engineers at AWS and Google who have expressed that they would like to help us get it working on their clouds and so they provided us with cloud credits and some engineer time as well which we are very grateful for. I'll contrast some of the solutions that AWS and Google have built into theirs and we're hoping to try and make customer in the cloud a more formal offering from AWS and Google and Oracle in the future as part of their sort of, you know, something they might be able to provide support for. And finally, I think possibly the most important feature of software is that it's open source. Completely free, available, MIT licensed, anyone can use it, hack it, steal it, do what they want with it and it's all on GitHub so people can provide requests and issues and all that stuff and that does indeed happen. So I think the main big features here are the versatility and the fact that it's dynamic. You can have multiple types of nodes and they turn on and off whenever they're needed or not needed as it might. So into the technical details a little bit. As I said, there's two main technical blocks to what customer in the cloud really is and it's core and the main entry point that people have, it's a set of Terraform scripts. So we have a separate implementation for each of the three cloud vendors that we currently support. The code is relatively independent between each of them because the Terraform API is basically rewritten for all of them. And as I said, I learned Terraform for this project so I've evolved along the way as to how this code is structured. But it's a few hundred lines of code for each of the cloud vendors because it's not doing that much work. It's making one VPC, one subnet, starting up one virtual machine and creating one shared storage solution and then it's just a bit of security and firewalling to glue all that together. So it's not doing that much, we're trying to keep wherever possible the complexity of the system to a minimum and try and only add features when it's available in all of the cloud platforms to try and keep that feature parity between them all. So we really do try and keep this code as simple as we can and not provide too many knobs and dials which people will tweak and require more support work from us. The other big chunk of code and this is a larger amount of code because there's a lot more fiddly details to be dealing with is the Ansible site. So this gets kicked off when the management node is turned on and it's used to start each of the node images as well. So it's about one and a half thousand lines of code and again, it's up down GitHub so you can have a look and cringe in my terrible Ansible code and this is what does the actual bulk of the configuration. It sets up the mounting the file systems and user management and monitoring and web interfaces, all this stuff. It does the standard stuff that you need on top of the base image. So I'm going to go through now like the actual underlying process that cluster in the cloud follows when you are turning on running nodes doing things with the system. So as I said, it's scalable. It's going to be turning nodes on and turning nodes off as and when they are needed by the user who's trying to run jobs. So when you create a cluster, you say create cluster it onto the terraform it sets everything up and you log in. One of the only other configuration steps you need to do is to give it a sense of what kinds of nodes you want your cluster to be made up out of. So you might say, for example, I want 1000 of this particular instance type whether it's a Graviton or a bare metal or a T3A micro or whatever kind of node type you want to use, big or small. You say, I want this many of this type, this many of this type, this many of this type and due to the way we've integrated it in with Slurm, it uses that behind the scenes to make all of those up to 1000 virtual machines. Virtual machines in Potentia, if you will, they're sitting there in the configuration. They don't exist in the cloud yet. They only exist in Slurm's memory and it knows that if it needs to, it can turn on or create one of those instance types. So they are sitting there ready to be created but at this point, you're not paying for anything because they don't exist yet. So the way that Slurm then hooks into that is that you log into the cluster and you submit a job saying, I want this much RAM, these many cores, I want a GPU or not, I want this architecture, whatever specifications you want to set for your job. And based on that, Slurm will look at that list of all the potential nodes it could use and it will choose, in principle, the best one for you or at least it will choose one of them for you. It will, in general, choose the smallest one. Once it's decided which node it's going to be running on, it's then going to kick off and run an API script behind the scenes to actually create that node. It's going to create a virtual machine in that cloud environment from scratch based on the image that was created earlier and configured using Ansible. So it's gonna create that node from scratch, turn it on and once it comes on, the job is going to start running on that particular node. That takes however long that takes. We are currently not implementing any preemptible or burstable nodes or anything like that, so things don't get interrupted yet. This is all on-demand pricing. And so it's gonna run that node, run that job when the job's finished, that node is now empty. It's ready to be used for another job if there's another applicable one in the queue. However, if there aren't any more jobs to be run and the queue's empty and that node is no longer needed, after some time out, it's going to destroy that node completely, wipe it out, delete the block storage. That node is no longer being paid for. It doesn't exist anymore. It then goes back into slurms, little in-memory list to be the, this is another potential node I could submit a job to. And this allows us to keep the amount of nodes that are currently being used down to as minimum as we can. So by way of diagrams to talk through that, we have your laptop up here at the top and you are logging into the management virtual machine, the shared login management virtual machine in the cloud environment. You submit your job and it goes and creates a node from scratch and starts running your job on it. If you submit more jobs, it's going to make more nodes and this is how you avoid having to queue. If you set your number of potential nodes high enough, there's always gonna be some in that pool where you can grab a name, turn on and create the node and start running your job. As anyone part of the job finishes, the node, it gets turned off. And once they've all finished or as each one finishes after the timeout, they get destroyed and they no longer exist and you no longer have to worry about them and you're no longer paying for them. So you are still only paying for the upkeep of your management VM and for any amount of storage you're using. So this is keeping those lower level costs down and people do worry about spending money in the cloud. People do fear the cloud because they don't know how much it's gonna cost. So by trying to promise them that we're gonna keep the base cost as low as possible, tends to make people feel a little bit happier. So here is one of our monitoring views showing an example job I ran a while back. This was running on AWS. So this was showing example of someone who's logged into a cluster and there are no nodes running at all. It's completely empty. So over there where the Zeds are, it's no nodes being on at that time at all. They go ahead and submit a 40 node a week job. So they're saying, I want to run 40 independent things that don't all have been running simultaneously. But because we're on the cloud, we might as well make the most of it and get our results sooner. We could run them serially, but let's just do them all in parallel and get the moral time down. It takes on AWS maybe a minute on average to start a node. It's really nice and fast, which I'm very impressed with. We had some discussions with the AWS parallel cluster developers and they were asking me how I managed to get the node start time so short. And I don't know just how long it takes. I don't know why it would take any longer than that. It's just turning on a BM through a web interface with an image. But I think one of the issues is how a lot of cluster seems to be longer startup times. So you're waiting longer for your job to start. It's not quite a two hour queuing time, but nonetheless, you want it to be as fast as possible. So that spinning up time does vary based on how much availability there are based on which cloud vendor you're on. There is quite a bit of variability in how long that takes. But on AWS with small VMs, I've always found it to be very, very fast. So you're not wasting too much time there. Then your jobs run, they go through, they're doing their work, they're analyzing their data, whatever you want. It's just a CentOS 8 image running on there. You can run whatever software you like. On which note, we've also been integrating things like singularity so that people bring their own images along. So not having to configure the node image itself, you can just run the singularity on top of the base OS image which allows people to bring along some software they've got sitting in a Docker container they've been using to do their research. They don't have to recompile it or reconfigure it or anything like that. Once those jobs are finished, it then goes idle. So here we've got a, looks like a five minutes, probably wait time. That's a configurable amount of time. You can ask it to wait. You could set it to be an hour. You could set it to be zero minutes. It depends on what kind of workflow style you're expecting to have. On the whole, we've designed customer in the cloud to be used for small research groups. So I'd expect someone to create a cluster for a single campaign with maybe only a handful of users who are actively working on the cluster. Maybe only one user, up to maybe five or so. So it tends to be not too much contention and overlap. And so on the whole, it's not usually the case there's going to be another job ready to jump in straight away. We expect people to use a cluster for a campaign and when they're finished, copy the data off, shut down a cluster and then they're not paying for anything at all. It's not currently designed to be a replacement for an institutional cluster where you've got thousands of users and so on because the economics of that situation are a little bit different. And so you have different priorities and pros and cons. But after your wait time is over, however long you decide it to be, the nodes get turned off and destroyed and you're no longer paying for it. So on a five minute job with a five minute wait time, you're only paying double what you should be paying if you know what I mean. If however, this is an hour long job and you set your wait time to be zero, you'd be paying a fraction of a percent on top of what you should be paying because you don't, nodes get turned off afterwards, they're not on beforehand and they just get turned on as and when you need them. So I've tried to keep my research software and engineering cat on when developing cluster in the cloud as much as possible and make sure I'm making everything tested and evaluated correctly. So we have a test script which goes through and starts a cluster on each of the cloud vendors but here looking at AWS specifically, it starts from a baseline where you have nothing running on the cloud. It creates a cluster from scratch, starts things up, creates the node images, submits the job, checks the results, make sure that things configured correctly and then turns everything off. And last time I run this, the full test took about 17 minutes on AWS which is a really, really good time. I'm really happy with it taking only that long to create a cluster entirely from scratch. And I like to compare this with how long it takes to provision a physical cluster. I'm sure there's people here who've worked with a visioning clusters for your institutes or universities and it takes more than 17 minutes in my experience, your talking months. It's taken years or more than a year before with clusters I've been affiliated with. So 17 minutes is a pretty good benchmark to be working to. And on the whole, the job submit to job start time is one minute. It's not instantaneous unless there happens to be a node that's already running in the background. And this is certainly better than hours long waiting in the queue that people are used to on many batch systems. So I want to talk about some of the pros and some of the cons of what cluster in the cloud does. There are definitely places where it does well and places where it's not designed for or we haven't got round to kind of working on that use case yet. So the place it's most been used so far is for heterogeneous and high throughput tasks where you have a whole bunch of data files you want to analyze, you want to run them through something like rely on or doing a whole bunch of different parameter sweeps through Gromax or something like this. So each of them is running an independent task and usually want to churn them through the system. It works really well like that because we haven't yet got set up multi-node working so communications through the high performance networks, high performance storage solutions, things like this. So as long as each job is relatively self-contained it works really well in those situations. We've also found it works really well with pipelines tasks where different points along the pipeline have different requirements for what they're asking for. So we've had, I'm working with Relyan on one of the early projects we're doing with this doing cryo electron microscopy image analysis. And there we had some steps early in the pipeline which were doing mostly CPU bound tasks and lining images at the time it was a CPU bound task moving through to later stages where it was GPU accelerated. And we didn't want to pay for the GPU nodes when we don't need them. So we broke it up into a pipeline and I used different types of nodes and different numbers of nodes and different parts of that pipeline. And it works really well for a situation like that because you've got a lot of configurability about what's going on. It also allows you to be more specific. A lot of cloud vendors have got more variety in how fat a node is. Generally, if you've got an institutional cluster it will have most ones I've come across they've got one or two size of node. You've got a 24 core node with this much RAM and that's all you get to work with. And so you start optimizing the size of your job to fit within that node shape with the cloud with any cloud interface like this. You can start thinking about what's the right shape for my job and then choose the right resource to fit it which is for many researchers an easier thing for them to work with. It also gives you access to the latest hardware. So earlier this summer, what year was it? Last year, last summer in 2020 we got access to the Graviton 2s the ARM processors on AWS. And a point of pride for us is that we got the Graviton 2 processors integrated into cluster in the cloud and available by the public before AWS's own competing product parallel cluster had support for Graviton 2. So we beat them to the punch with their own hardware which I was quite happy with. It's because we had a deadline for a workshop so I really crunched it to get it all done. So by contrast, it's not optimized in the moment for multi node workloads. So if you've got things which are 10,000 nodes doing MPI, doing communication, things like that it's currently not probably the best tool to use for that because we haven't got things like placement groups and elastic fabric adapters and all the other high performance network stuff that you need if you're gonna be doing really high intensity multi node communication. We are working on this. We've got some collaborations with some industrial partners who are helping us get this work done. So I'm looking forward to seeing the results of that because particularly people want to use cluster in the cloud for benchmarking. And so if you're gonna be benchmarking the cloud you need to make sure you're getting the most out of the interface that you're in. Second point that it's a bit weak on at the moment is it only has cheap shared storage. So most cloud vendors provide you with some kind of cheap NFS mount file system just so you've got a shared file system between the nodes and currently that's all we have support for. So if you've got a job which are doing a lot of reading and writing from disk and even another node which is trying to read and write from that same place and they're communicating via that mechanism that's gonna grind to a halt. It's not gonna be anywhere near as efficient as you might be used to with an institutional cluster. Again, it's something we're looking at improving in the future by having pluggable support for whatever kind of storage solutions in my course. We've had really good experiences with using it for teaching clusters. So for example, if you're teaching HPC and there's you know, interested in that in these groups here being able to create a cluster in the cloud which is completely independent, siloed, safe, you can let people go mad on it. And when the workshop's finished, you shut it down, destroy it, it's gone. You don't have to worry about security or runaway jobs or anything like this. It's completely mopped up, cleaned away afterwards. And similarly with benchmarking because you've got access to every single possible node type on AWS or Google, you can just run your code on all of them. And we did this for example, looking at some benchmarks for the AWS Graviton processors versus their AMD versus their Intel, which I gave a talk about over the summer. So there was some interesting results there and cross on the cloud made that really easy. It's also quite good for some of the sort of meso data we middle-sized stuff, things like Dask and Spark and Singularity. It's gonna allow you to kind of plug in those usual sort of interfaces that people are starting to get more comfortable with. One of the problems we've had with Dask for example, which is a Python tool for running stuff across clusters, is if you run on an institutional cluster, you're sitting there in your Jupyter notebook, want to submit your job to run on four nodes in parallel and you sit there in the queue. So you run your cell and you wait 10 minutes, half an hour, an hour, it's not optimal. With something like cluster in the cloud because there's no queuing, you run your cell, it takes up to a minute to start your node, your code runs and it's all there and waiting for you. So you've got that interactivity, which I think is a really useful and powerful part of the cloud. So we work inside universities, it's the context of what we're doing here. So we've got a lot of research going on. We've particularly been working with sort of life sciences, biomedical research, things like this. This has been a bunch of papers and research being done, which is used just in the cloud as it's computing mechanism for getting their work done. So some of the early projects we did were on vaccine delivery and smoking cessation. Some groups at University of Bristol work on those and it was really successful project we had and that was what cluster in the cloud was originally created for. But more similarly, of course, with the coronavirus outbreak, we've had to pivot a little bit and make sure we're supporting COVID research and we had a couple of publications out of that. People using cluster in the cloud for doing molecular dynamics and image analysis for doing research into the virus. And there's been some really interesting results out of that and I think cluster in the cloud has been really valuable to them for this impulse of being able to run your research and get the results out really, really quickly. Coming to an end now, a bit ahead of time, we've got some time for some questions. Future plans for cluster in the cloud. I've touched upon a few of these throughout. There is currently a web interface in cluster in the cloud. It's not advertised because it's still under development, but you want to flesh that out. You want to make this a more usable and easy product. It's all very well me saying, I've made this thing to make it easy for searches to use the cloud if some of them aren't comfortable with using command lines and having to do any level of sysadmin. They can be much more comfortable with a clicky around interface on the web interface and being able to do something there. High-forms network and storage solutions. I mentioned those earlier. Those are things that we're either working on or at least have an idea of where we want to go with them. They are relatively high priorities for us. Back up to the cloud storage at the moment you destroy your cluster, everything gets wiped out, nothing left. We want to have a little bit of a softer close so that people can copy their data off and come back to it later on, move on to Glacier or something like that. And we do finally want to get support for other counts. So we've had a bit of a talk with some people that as you're about getting some support there and some people from working with OpenStack, that should be something we can do in the near future. One of the problems with all of this is the cluster in the cloud is not a funded project. It's something that I've worked on in my spare time amongst my full-time job. It's slightly been used to support some of the research at the university, but it's basically been a hobby project until now. So it's been limited by available funding and time. And so we have to really choose what we want to prioritize. So finally, I would say thank you very much. There's the link to the documentation there if you want to have a go with it. You will need cloud resources if you want to actually run it. So it's not something you can run on your laptop yet, but if you do have access to some cloud resources, do feel free to have a go. I'd like to thank AWS, Google and Oracle for all having given me cloud credits in the past for testing out customer cloud and the rest of the Bristol RSE team for helping out with all of the software development stuff along the way. So thank you very much, I need everyone and I'm happy to take any questions. Thank you to Matt for the talk. I'll just ask the first person to unmute so they can ask a question. Thank you, I'm Sabri from Oslo. So you touched it a little bit about this data transfer and I also see that you have a lot of biological software being used. So they use blast databases for example and people who are using HPC, they are sort of used to having certain storage mounted like they install things in their home area, they expect the Python package to be loaded in their job and they want some data to be available when the job runs. So how a person using your system take their data from their computer to this cloud and how the persistence is? Yeah, so far the model has been, like I said, we tend to use cross on the cloud for a campaign. So it would usually be we create the cluster for them at the beginning and we help them because they're generally not aware of how to be transferring even medium amounts of data on the internet. We help them move it onto the cloud cluster and it sits in that NFS shared area which is accessible from the login node and from the compute nodes. So it can effectively sit in their home area as you say and they've got access to that from wherever. Reading that kind of data tends not to be the bottleneck. The bottleneck does tend to be the compute side of things at least on the kinds of jobs that we've been working on so far. There are going to be IO bound compute jobs but we haven't had to deal with them yet so we haven't optimized for that use case. But in general, yes, it would be if they've got a whole bunch of data, blast database, whatever they want to be using in their job we would just copy up to the cluster, it would sit in their home area and they would use it from there and that's generally worked out. Pass to Alan next. Yeah, hi. Just a comment about the interconnect support. So I worked on this recently with Magic Castle and tested out the InfiniBand Fabric on Azure and EFA on Amazon. And so a comment about the Amazon and EFA currently in Terraform, it's not supported. And so if you have a bit of leverage with Amazon it would be good to have somebody else pushing them to do that support in Terraform that would be useful probably for you too, I imagine. Yeah, they're great. So far we've mostly been doing this kind of thing through the Python API side of things so creating placement groups and so on but we're still thinking about exactly how we want to configure this stuff. So I do think that having some of this configurable through Terraform or even moving some of our dynamic configuration to be Terraform-based could be a good solution. So I agree it should definitely be configurable through Terraform and that's my list of things to talk about next time I talk to Amazon because that definitely sounds like a good idea. Pass across to Jorg next. Thanks, that's a very interesting project. How do you do the software? Are you planning to do something like the easy project and simply mount it or do you install it or is everybody using something like Singularity? I mean, I should expect a question like that from the EasyBuild user group. So thank you for asking. So far we've left that as being a layer just above the opinion that we have. We try to only be opinionated to a degree. Every extra piece of software we make a formal part of our specification effectively means another thing we have to support and test and maintain. So so far we've said we give you a plain cluster and then provide information or help if you want those things. As until recently my solution who had always been tell people go ahead and install and set up EasyBuild and use that to install the software onto your shared area as a common easy way of doing things. People are used to using module load to grab stuff and so using EasyBuild to put all the software into the system seemed like a good solution. Kenneth came to one of our Custom the Cloud workshops just for Christmas and introduced me to Easy. So I'm very tempted to start having like that, Kenneth, we've done some work on it into setting that up in Custom the Cloud. I used to work in particle physics so I'm very familiar with CDMFS and I think it's a really good match for this kind of environment. So I think in the future we'll have a, we might almost already have it, a single switch to turn on easy access so that you've got access to all of that software across the cluster. So I think that's the simplest solution and it's also good because it's basically trivial for us to support as long as it's got support for the correct architectures and so on. I think that's going to be a really good fit for us. We want to keep things like Singularity and keep things like SPAC EasyBuild options so that people can set their cluster however they want. But if there's a possibility like Easy which is zero maintenance, then absolutely we should make the most of using that. Yeah, I'm thinking along similar lines. I think Easy is quite good for common stuff and if you go into the bioinformatics area where you've got more bespoke pipelines, for example, which are simply not part of EasyBuild for one reason or another, here a Singularity container might be the best way forward. Yes, exactly. So we had one place where we really needed Singularity. We were working with a group from a university near London who had a particular finite element analysis package of a specific patch version which needed a specific patch version of Python. And so it was only set up to work on Ubuntu. So we were like, we're just going to bundle that into Singularity and then use that rather than trying to engineer a more sort of traditional solution to this. So I think there's the more options that we provide, especially if they don't require any maintenance from me, then that's the way I'm going to go. Yeah, thank you. To jump into that a little bit and especially good fit for between cluster and the cloud and Easy as that Easy also pulls in stuff on demand basically through CVMFS. So only the stuff you will be using and you have enough with a pretty small local disk where the software will be pulled in transparently. So in terms of cost saving and all that, I think it's also a good fit to combine but it makes a lot of sense. I don't see any other ways to hand. So let me ask you a question as well. During the workshop in December, I played a bit with cluster and the clouds to figure out how hard it would be to add Easy to it. It turned out to be fairly, fairly trivial. It's told CVMFS and told the right configuration package and it was basically there. And I think AWS gave us like what was it 25 or $50 of credit for playing around with that that was very useful. I tried to keep a good eye on things just to make sure I wasn't ramping up costs and getting in trouble. And as soon as I was done playing with the workshop I killed the whole cluster and I said, okay, there's now no more costs. So that was shortly before the Christmas break. And then after the Christmas break, I noticed that I had ramped up like $25 of costs and it wasn't quite clear to me what had happened. It turned out that was the disk images. I was rebuilding the disk images a couple of times because I was testing things, not really paying attention. And I guess I was sort of assuming that cluster in the cloud would throw away the old disk images which I wasn't using anymore, which doesn't seem to be the case. It seems to be leaving them there in AWS. And if you have a bunch of them that I had like 20 versions or something, it starts adding up over time if you're not paying attention. Is that just something you haven't worried about yet? It's something which I didn't know about at first. I'd assumed that the way I was running it was going to destroy them. One of the issues with all this stuff is that as you point out, any bug or misfeature like that costs money. And so it's quite scary having to provide people with access to this stuff. So I think that's something we need to prioritize fixing, absolutely. One of the issues at the moment, for example, which is why I didn't do a demo today is that we don't have any cloud credits on any of the cloud vendors because this is all ad hoc, going to them and begging for money kind of situations because universities struggled to buy cloud. So we haven't got the ability to test and develop at the moment. But that is definitely something that should be prioritized fixing. Absolutely, you're right. Things like that can stick to the cracks. So yeah, that will get pushed up the list. I should put you in touch with Christian. I don't know if he's on the call, but he works at AWS and he's been quite generous with giving us AWS credits for testing and playing with stuff. So he can probably help you out with setting something up that's good enough that you can at least are not held back in terms of development of Cluster in the Cloud. Yeah, I mean, development-wise, we need, I think I developed the AWS integration and it cost about 40 pounds or so of cloud credits to do the whole thing. So it's small amounts of money. It's the longevity of the credits, which has always been the issue. They run out and then we have to go back to them and ask for more and then they run out and then it's management and that takes time. Yeah, okay. But thank you. Cool.