 Welcome back. Our next talk is from Felix Antoine Fortan from the University Laval. He'll be talking about Magic Castle. Is that okay? Is it my turn? Yes? Yeah. All right, good. Hi, my name is, as Simon said, my name is Figsal Donforte. I'm the new team leader of the research software development team that is being built at University Laval. And I'm going to talk to you about today about Magic Castle, which is a project that I've been working on since around 2017 in collaboration with Compute Canada in order to solve an issue that I found in Compute Canada. So for those who are not aware, in Canada, we have what is called a Canadian digital research infrastructure that is currently managed by an organization named Compute Canada, which is a form of different organization across Canada. But the most important part is that we have supercomputers and data centers across Canada. There's five of them that are mostly new. Four of them are HPC cluster that are used and freely available to anyone, any researchers in Canada. And we also have a big data center that is doing cloud, so a major open stack deployment that is available to researchers in Canada. So as a staff of Laval University, I've been able to bring researchers from across Canada and mainly Laval University researchers on how to use those infrastructure. And eventually, we also teach those users how they can use HPC cluster in Canada to do their research. And so in Canada, across Canada, we have as an average, and those numbers are probably not up to date. But you can imagine that we do around 150 workshops per year across Compute Canada. And those efforts are coordinated. So someone in BC can collaborate with someone in Quebec in order to eventually put up a training that is hosted in different institutions. Most of our workshops, so within Python, we do Spark, and most of these workshops try to use the HPC software environment because we are teaching those programming, those how to use a scheduler notion in order for Compute Canada users to get better at using HPC cluster and have a better usage of the different environment. The main issue that we found in 2017, that in order to be trained on how to use an HPC cluster, you need to have an HPC cluster account. And if you are just starting your training, if you are just getting used to training on a cluster, you might not have that account. And so for training, what we used to do is build, is request what we call a guest account on our production system. So that's an issue because first, at the time, the process was not very well done. I took a few days in order to get guest accounts, and you end up with potentially issues while doing your workshops, if there are down times with a cluster, your workshops have issues. And the other aspect that was not really fun was that you would have to run potentially training jobs next to production jobs doing like actual research. And we might not actually need an HPC cluster in order to train how to use an HPC cluster. What we need is the actual software environment. We need a cluster, we need, at least from a software point of view, we just need a replica of an HPC cluster. So I started to ask myself in 2017, could we replicate our current compute candle HPC environment for training somewhere else out of an HPC cluster? So we came up with a solution that we call Magic Castle. So Magic Castle is an operating project that instantiate a compute candle cluster replica in any major cloud with Terraform and Puppet. So Terraform is a tool that we'll see on the next slide, but what it does is it creates virtual machines, a management node, login node, compute node, it creates a volume, the network, the network HCL, certificates, the DNS record, the password, it installs the software, it configures the instances in order to, you get in 20 minutes using only Terraform, you get a replica of a compute candle cluster in any major cloud, including Azure, Google, and mainly AWS and mainly on any OpenStack. Since in compute candle, we had this big OpenStack cloud that we could use for a train. So the project is hosted on GitHub has been presented numerous times and we're going to go through the different aspect of that project. So there are two components of and two main projects that are Magic Castle. If you go on GitHub and you are on compute candle Magic Castle, what you will find are the Terraform source. So Terraform is a tool for building, changing and versioning infrastructure. So they have a special language where it's called the AshiCorp language that allow you to describe an infrastructure that means volumes, virtual machines, network, all of that stuff is being installed, is being programmed already in Magic Castle. But Terraform is just a language that allows you to program, describe what your infrastructure should be using the AshiCorp language. So Magic Castle first is a set of Terraform configuration file that allow to replicate a compute candle instance, but a compute candle cluster. But once these instances are created, they need to be configured. So how we configure those instances, we use Puppet. So Puppet is a config management tool that is also using his own language that is Puppet language and one of the advantages of Puppet is that on each instance, there's an agent that looks at the actual configuration of the instance. And if there is some change that does not correspond to what should be the actual configuration of that instance, it is modified automatically. So with this Puppet, we have some of an always-on-call automatic system in that looks at the actual configuration and fix it as much as possible by itself. And it's going to be important for the next, for the different next aspect. Magic Castle does not have the pretension, does not pretend of being the only kid on the block doing HPC in the cloud. It's a very hot subject at the moment. And there are a lot of projects that are available. And I have listened recently to a podcast with Kelsey Eittauer, which is a principal engineer at Google. And he came up with a very nice quote that I would like to read to you, because Magic Castle, Indiana, is some form of a DevOps tool. And Kelsey Eittauer, I just say about the DevOps landscape. So he said, when I think about the DevOps landscape, we have so many people, just like chefs in a restaurant, that are experimenting with different ways of doing things. Once they get it, they create those recipes. Those recipes in our world is source code. That's why we always have duplicates in similar projects. Because there is going to be one ingredient that's going to be slightly different to make you prefer over something else. So what I'm trying to do here is just list all of these projects. And I think there's a way, there's a lot of potential for collaboration among those different projects. And the next slide is going to be, what are the distinct wish feature that I think, and one of the designing principle that I came up with when I built Magic Castle. And the first aspect and one that is not necessarily covered, the previous project is that Magic Castle does not have a custom common line interface. The interface in order to interact with Magic Castle is strictly Terraform. What we deliver as a Magic Castle project is just a Terraform module. All of our configuration is strictly managed with puppet and a bit of a cloud in it. But most of the configuration engine is done with puppet, which is not common. But the reason is that Magic Castle, yes, is built, is meant to build HVC cluster in the cloud. But my idea was also to give back some of these modules to our Compute Canada ecosystem of HVC cluster. That is also using puppet and being able probably benefits from the module that will be built inside of Compute Canada. So this is why we went for puppet instead of ansible, chef or anything else. The third aspect is one of the pet peeve of mine, but SLNX should always be enabled. And I've been enabled on Magic Castle since day one. And it has, it's been, it's been painful. It is not fun of actually having to make things work with SLNX, but I've learned a lot while doing it. And there's, there's always example of where it actually saved my ass. Just yesterday we had the CVE of baron Semedy. Well, because SLNX was running on Magic Castle, the sudo issue was not an issue for, for, for normal user. Normal user on Magic Castle cannot simply run sudo out of the box. So at some point, those effort will benefit. And I hope they can, like, I can give back to Compute Canada and other HPC cluster on how we can actually activate SLNX. And we, I wanted to make sure that I wouldn't be the single point of resources and information about Magic Castle. So I tried to maintain as much as possible in extensive user documentation. So those are the guiding principles that I try to maintain while developing Magic Castle that are potentially distinguishing teachers from the other project that I've shown you on the slides before. But even if they are not, as I said, there's always room for collaboration and things to inspire each other on these hot subjects that is HPC cluster in the cloud. So what do you get when you actually build a Magic Castle? So when you download a Magic Castle release. So the infrastructure is built among three instances that are deploying services. The first type of instance is just a typical login node on which you can connect with SSH through the internet. But since one of my mandate in Compute Canada was always to try to find new services that allow ease of access to our HPC cluster, we also, when you start a Magic Castle cluster, you get a Jupyter Hub with all of a series of tools. And you also get a global standpoint in order, again, to be able to teach users how to, for example, to use Google's. All of the main management services are running on the single instance that are called management one. So we're going to see on the next slide what are these services. But there's our database, LDAP, Puppet Server, and all that stuff. And finally, you have the compute instances on which the actual job are running. And I probably forgot to mention it, but when we create Compute Canada cluster, we mean a Slurm cluster. So all across Canada, we are using Slurm as a scheduler. So this is deploying a Slurm scheduler and not another scheduler. So when you want to actually use Magic Castle, as I said, you are actually using a Terraform module that is composed in a single DAX file that we call the main interface. So when you download a tarball or a zip file of Magic Castle, this is what the release looks like. And the first file that you will find is main.tf. So this is the file that we expect user to modify. All of the other files that describe the infrastructure are available to modify. But all of the main is where the first entry point for common users. So there are first sections for a main module file for Magic Castle. Eventually, you happen to select your provider. But what we do is we release actually a tarball for each cloud provider. So when you go on Magic Castle web page, you actually download the release for the cloud provider that you need. Then you're going to describe what kind of instances that you need. And then you're going to describe how much user, et cetera. And eventually, you're going to say, well, I would like my cluster to be registered inside of a DNS. So you can configure a DNS configuration. We're going to go through those four sections. The first section is very easy. Source the name of the provider. As I said, there's one tarball per provider. Whether it is AWS, Azure, Google Cloud, OpenStack, or OVH, all of these providers are supported out of the box. And when you download one, this line is already filled with the correct provider. The next step is actually naming your cluster, which I think is the funnier part of actually building your cluster is finding your funny name for your cluster. And now you can do that multiple times a day. So I go through all of the Molliverl universe multiple times a year. So you choose a cluster name. You choose a domain. Ideally, you actually own the domain name. So you will be able to, Magic Castle will be able to record for you the different DNS record for your cluster. So you can use the domain name instead of the IP address to connect to your cluster. You select the image. So Magic Castle only supports CentOS kind of image. And it supports CentOS 7 and 8. I'll cover that later as to what the future brings us regarding CentOS. Since we are mainly targeting cluster for training, you can create automatically a number of guest account and Magic Castle is going to create a random password that you can then distribute to your user. We'll cover how you could actually create user without going to guest accounts later. And then you specify your public key. When you create your cluster, you get an administrator account. And the only way to connect to that administrator account is with your SSH public. Next step is just designing your cluster. So you are defining which type of instances. So the size, number of core, quantity of memory for your management node, your login nodes and your compute nodes. And as you can see, the compute nodes is a list. So you can create an heterogeneous cluster. And at any point of during the life of your cluster, you can modify the account manually. And for example, if you would like more compute node, you can adjust the number of compute node, reapply the main and the cluster is going to adjust by itself without any intervention from yourself except from modifying the main file. Then you define some storage. So the storage is, for now, it's pretty basic. It's a NFS storage of three volumes. One for OM, the OM user, a volume for project and a volume for scratch. All of this for now is NFS. And you can then define what's the actual size and gigabytes of these volumes. Then there are different, since all cloud provider has some specificities, you can, there are variables that are available depending on the cloud provider that you've chosen. So for example, if you like to attach a Google GPU, there are some variables that you can input. You can, for example, for Azure, Google or AWS, you have to define the cloud region. So there are some special variables, but all of these variables are clearly explained in the extensive user documentation. And then, based on the actual input, the domain name that you have selected and all the previous input, you can create, Magic Castle can create for you automatically the DNS records. As long as your DNS, your old name is, sorry, is administrated by either Cloudflare or Google DNS for now, we could add some more DNS provider at some point if you need another one of these. But those are the main one that we currently use. And all of the, most of the input for the DNS comes from the actual main module. All you have to specify is your email and most of it. So once you have completed your main.tf file, what you enter inside of a terminal next to your main.tf file is just terraform apply. And it's going to ask you to confirm that, yes, it is the right infrastructure that you want to build. And automatically for you, it's going to create the instance. And once these instances are created, they are being configured by Puppet. And in under 20 minutes, you get it completely out of the cluster. So as I said, the configuration management is endowed by two aspects. So the main bootstrap of Puppet is done through the Cloudimit YAML file. So what it does is, it's upgrading all of the instances. First step is always upgrading all packages. So every time you create a new cluster, you always get the latest revision of all the packages. It has created me a lot of headache of being able to support always the latter's version. But I think it is the most secure and the best way of actually handling those aspects. So the Cloudimit installed Puppet, created Puppet server, wait for the certificate. And once the Puppet is bootstrapped, everything is rebooted and then Puppet handles all of the configurations. So the different compute log in and management code communicate with the Puppet server, ask for their configuration and configure, install the packages, configure the file, etc. automatically for it. The configuration management is also endowed with console. So console is a service mesh that you probably won't find very commonly on HPC cluster. I use it to do, to do, to determine when services are available. And as a key store value to make available the information about, for example, the compute instances configuration. So automatically, which is not normally easy to do with slurm, we actually generate automatically the slurm configuration node file with console. So when a new compute node logs in, it register its configuration in console and using a service called console template, the configuration file is updated automatically everywhere on the cluster. So it means that at any point we can add and remove nodes on, out of slurm configuration without, without any issues. And we also use console to aggregate the CPR architecture. Since on the cloud and mostly on compute Canada cloud, you can get instances with AVX or AVX 12 or AVX 512, sorry, YVX 2. We need to gather these to select a common set of module that will work on every architecture from our compute instances. So we use console to select that. As I said, we can generate automatically the slurm configuration by registering automatically with console. And one of the nice aspect of this is with, we also automatically, if we have an heterogeneous cluster, we now automatically compute the weight of the different instances. So if you request a job without a GPU, and with low memory, you are going to get the low memory node first, and only the GPU if everything is, is allocated before. So all of these weights are computed automatically by a small plug in that I've left up for console and slurm. So what is an HPC cluster without software? So as I said, there's two operating system that are supported, CentOS 7 and CentOS 8. I don't know what's the future for CentOS 8 in Magic Castle. Would it be Rocky Linux? I don't know. I can have this question over and over. In the end, the future depends on compute Canada. Magic Castle will always try to first serve the interest of growing and building workshops for compute Canada and new instances of compute Canada in the future. So when we steer in a direction, Magic Castle is probably going to follow in that direction or support Rocky Linux plus CentOS. As I said, there are batteries included. So a free IPA for user management. There's NFS, Slurm Global Sandpoint. We installed Jupiter. Of course, there is Elma provided by CDMFS. We have a no VNC desktop singularity. Everything is prepackaged and installed through Puppet. And if it is not, it is generally provided by CDMFS. So I think most of you on the call are already familiar with CDMFS. So it's a scalable, reliable, and low maintenance software distribution service that is used in compute Canada. But it is now used also by the European environment for scientific software installation. I don't know if you change the acronym yet, but and all these software is all built with EasyBuild. And this is how we can actually provide a cluster and each piece cluster in the cloud with tons and tons and thousands of futures and software. It's just by mounting CDMFS and whatever the user needs at some point during each PC cloud cluster, it is available. So it's fantastic. And it's the initial point that gave us the idea of actually building an interesting cluster in the cloud. Otherwise, we would have to rebuild everything at every time and it would be a waste. There are more batteries included. So as I said, you can create guest account. But now when you create a Magic Castle cluster, you also get a free signup portal. So directly on your cluster, users and workshop attendees can register themself with their email and create a password and their own username. And Magic Castle is going to handle all of the business of creating their own, creating the password, creating the account. All of that through a fantastic project that is called Mochi that is building a web interface on top of free IPA and that was built by University of Buffalo. Again, free IPA also have a web interface that is available through which, as an admin, you can manage users. Since Magic Castle is also a development platform for Jupyter Hub use cases. So it has created tons of plugins that are listed here, of which I'm the main developer. You can look them up, whether it's Jupyter Helmholtz or a web interface for Helmholtz through Jupyter, a web proxy for Paraview. We also created a form for Slurm to launch a job through Jupyter Hub on Slurm. And all of that is packaged in a puppet module that is Puppet Jupyter Hub that is currently used in Compute Canada. So what I said, I'm trying to seed ideas in Compute Canada on the HPC cluster. Our cluster are now using the Puppet Jupyter Hub that is using Magic Castle to deploy their own Jupyter Hub on their HPC cluster. But what about Terraform is too difficult? I get this point like many times when I gave a call on Magic Castle, so it created a new idea. What if we actually build a web interface for Magic Castle? So we call it, it's not very original, but it's called just MC Hub. So last summer I had an intern, a brilliant intern that came up with the project that is called Magic Castle Hub or MC Hub, which is, again, also available on GitHub. So the idea is to improve the workflow of Magic Castle by going from a text file where you have to install Terraform, download release file to just a web app where you actually fill just as you would on a text file. But now you are filling a web form and that web form has some of the options that you would have to enter in text in Magic Castle. You are now being filled automatically by MC Hub. So the instance name, the quarters that are available, all of that is available to you. And the only thing that you need is a browser and an account on the instance of MC Hub. It also validates that your quarters are sufficient for the amount of the cluster that you would like to configure. As I said, all of the different parameters that I've shown you before can be instantiated with Magic Castle Hub. In instance scale, floating IPs, all of that, Magic Castle, confirming the Terraform, displaying the progress, all of that is done in ViewGS. So instead of looking, verifying manually your quarters, you can look at PyChart. Instead of modifying a file, you can just fill a web form and reading documentation. Well, now it's self-explained so anyone can actually build a cluster. This is what it looks like from a user point of view. So you can create, you click on the create cluster button, but there's also an admin view. If you are managing an Opsistack project, you can delete clusters for your users. So MC Hub is composed of three projects, a web front end in ViewGS and a back end in Flask and Terraform. All of that is packaged inside of the Docker container that you can start on your own desktop and run the actual application. And there's also an insible playbook that deploy a SAML authenticated MC Hub. So the MC Hub currently allow you to put a front that will deploy a Magic Castle cluster on Opsistack project. So key takeaways. Magic Castle is a major project, a mature project with a rich ecosystem of all of these spin-off and projects that replicate an HPC cluster in a cloud with Terraform and Puppet. And once deployed, MC Hub, the new project, can be used by anyone. They don't have to be a system. They don't even have to be an HPC analyst. They could be a simple HPC cluster user, can actually deploy an HPC cluster on Opsistack using the web interface. So there are tons of future directions. We don't have official user meanings or web distribution links, but I'm very active on the issues on Magic Castle. We are planning, for example, to look at how we could integrate OFED to support high-performance network connectivity in the desktop cloud. We're also looking at luster file system, automatic compute instances scaling, and eventually support external IDP. As I said, the account currently are local to the cluster, but at some point, we can imagine, for example, to compute, to connect Google authentication or compute authentication on the cluster. And I think I'm done for half an hour. I can take questions now. Yes. Thank you for the talk. If there are any questions, then people can raise their hand in the Zoom meeting. And we already have questions. I will unmute the first person. I'll ask them to unmute. Hello, can you hear me? Yes. Yes, I'm Sabri from Oslo. So when doing training, it's very important that the users get a closer representation of the original cluster as possible. For example, mounting locations, how to get access data. So when you spawn these cloud instances, how do you handle the data transfers? Like, is a lot of data transferred from to the cluster when it's initiated or like, do you have something already in the cloud that you copy? For example, sharing data for a course, for example. So for now, most of the courses that we teach in Compute Canada have very little data that are required by the course. So they are copied by the instructor or the person who actually instantiated the cluster the hours or the days before teaching the workshop. And Maxime could tell you more about that since he has done it quite a few times in Compute Canada using Magic Castle clusters. But for now, we haven't taught a lot of workshops that require a lot of data transfer to the cloud. So the issue of transferring data hasn't come up with our system. Is there anybody else? Kenef has a question. You mentioned that one of the guiding principles or the things you try to stick to with Magic Castle is that you're using Terraform and Puppet as a front end, as a user interface. Now, I noted a question down that I was going to ask you. Do you feel that's limiting in terms of letting people use Magic Castle? So they have to both learn a bit of Terraform and a bit of Puppet. But I guess the MC Hub thing largely is an answer to that. Yeah. You're right. I found that I had, at least coming from, for example, HPC analysis inside of Compute Canada, I had to answer questions regarding the first limit, I would say, regards input validation inside of the Terraform module. But AzureCorp started to add features where you can at least do some form of static validation on the inputs. But it is not really up to speed. You can always input the wrong instance name or that. Currently, the idea is to use Magic Castle Hub as the next step of building this autofilled and error-free interface for Magic Castle based on our experience with our user inside of Compute Canada. So yeah, it is kind of limiting. But what I want to avoid is, and Magic Castle Hub has some other guiding principles that you can avoid these limits, but if I wanted to make sure that the Terraform module developer were the one being bugged with new API versions and not me. So if there is a new version of the Google Cloud API, I just want to make sure that I can update a bit my Terraform file, but all of the handling of the talking to the REST API is actually done through Terraform. Yeah, that makes total sense. But you could have a lightweight Terraform CLI, but an MCCLI where you prevent that people have to get inside the guts of Terraform and Puppet because it's two tools that may be a bit of a limiting factor or too big of a hurdle for people to start using it. Normally, so from the people who are only using only meaning to use the cluster as it is, shouldn't have to touch Puppet. Most of the time it's a right assetation. At some point, everyone would like to modify something. But there is also some aspect of configuration that are available through YAML file that I haven't touched that can at some point happen to tell Puppet what should be configured. But I agree, but maybe a lightweight magic castle CLI could happen at some point. I just need to maybe get some feedback on what are the main pain points that happen on the common line and that are not so painful that you actually need an entire user interface just like we propose with MCCLI. And then, since there are, I think, no other questions than we do have time, you mentioned a couple of other projects. One of them will be presented tomorrow, cluster in the cloud. Can you say something about the key differences between magic castle and some of these? You mentioned some of the things you want to work on here are covered in other tools already, like the scaling thing, which I think cluster in the cloud does pretty well. But are there projects that have good support for fast interconnect or things like cluster, is that out there already? At that, I don't know. I know. So I think one of the key difference between cluster in the cloud and magic castle is when, I think, I could be wrong, but I think cluster in the cloud actually instantiate first a single virtual machine from which it deploys an entire cluster. Magic castle tried to deploy all these instances at first. And then, if you'd like some more instances, you have to go through the main text file. And cluster in the cloud use Ansible to configure the different instances. And the instances are actually configured with Ansible from the compute node are actually configured from the first instances with Ansible instead of having puppet, for example. So those are, I think, key differences. If you prefer Ansible, I think cluster in the cloud is probably the other best solution out there that is available. Then it comes to it comes to different feature and probably the, as I said, the main idea behind magic castle was always to provide cluster for workshops. Then it evolves into like some potentially bigger cluster in the cloud that are more static. So the automatic scaling has not been an issue so far and not been like so much requested. And there's also, so when I mentioned SL Linux, I also have this, there's also behind on my, an afterthought of trying to make sure that magic castle is cluster is as secure as possible. And at the automatic scaling, at some point, you have to provide some keys that allow your cluster to interact with your cloud provider. And at some point, yes, it can save you costs. If you are, if since you are only deploying instances when you need them. But if at some point some minor finds your cluster and hack it because there's some some issue and you have your, your, your cloud cheese and directly in, in the clear on your cluster, they're going to find it and it's going to cost a lot. So this and they are probably like solutions for that. And I think it's, if, if either even closer in the cloud or other projects actually find like secure solutions for those, for those problems, it's, it would be interesting to actually collaborate on those issues because how you can actually manage like securely the automatic scaling of instances is, I think it's a no-book problem. Well, I think I could be wrong and I'm sure Matt will explain it to us tomorrow, but the automatic scaling that cluster in the cloud does doesn't actually need the keys or the password for your, for your cloud instance. What it does, it really sets up a static slurm cluster. So at that point it does need your key, but I think you can just give it the key whenever it needs it. And it does, it doesn't have to save it. It sets up the whole slurm cluster and then it uses or abuses the slurm power saving features where if no jobs are going to the nodes, it just powers down the nodes. So the VMs are sitting there, but they're not running. And I think it can just auto start them. And for that, it doesn't need to talk to the AWS API or anything like that. And their cluster in the cloud does have the, I want to say a proper CLI. That's not really true. It's like scripts left and right, but it does give you, give you like a front end to talk to that you don't even actually have to touch anything Ansible. So if you don't like Ansible or Puppet or whatever, it's not really relevant. You can, you can give it a script to install additional packages. And it's just a regular bash script that it picks up on if it finds it. But yeah, there's, there's, I think there's some ideas there that you can probably use here as well. If you're interested in that and magic castle and setting up a discussion with, with people like Matt working on tools that are very similar is probably worthwhile, especially for the security thing. That's a good point. Okay. I don't see any other questions. Anything in Slack, Simon? Alan had a small question, but I think Maxim is handling that in the Slack channel. So I think it was your question about the logo. I had the logo. So the, the, the unicorn and the rainbow combination, is that the logo of magic? No, if it's not, it should be. Yeah, I would need to buy the rights on that image before actually making it logo. But yeah, it is, it's a recurring question of where do we, when do we get an actual logo for magic castle? And it's, it used to be like the Disney castle. And then we, we were afraid of being sued. If this becomes the logo, I want to have stickers. I should contact Robert designer, graphic designer. They, tech always come up with very nice, very nice icons and graphics and stuff. I should hire them.