 Okay. Hello, everyone. My name is David Bugeau. I'm a project manager and software developer at the Canadian Centre for Computational Genomics here at McGill. And today I'll just give you a small lecture on how to launch a job on an HPC and more specifically on the Canada HPCs. So a lot of things I will present are quite specific to compute Canada, but can also be applied more generally to other resources such as Amazon Web Services and so on. So yeah, there we go. So first of all, what is an HPC? HPC stands for high performance computer. And so basically if you think about traditional in-house servers that labs have been having for many years for their computational needs, well, with the always increasing amount of data that's being generated by high throughput sequencing and those things, these resources can get quickly overwhelmed. So for example, like here at the genome centre at McGill, we purchased our own compute cluster in 2012, I believe, and like with just the ever-increasing amount of sequencing data that kept coming, it quickly became overwhelmed and we needed to turn to other solutions. In Canada, we have the compute Canada that comes to the rescue. Basically, these are like free compute resources offered to Canadian academia and research field. So again, like in this case, I'm talking about compute Canada, but there's also many other options which are available commercially. So a cluster or an HPC such as compute Canada, it's basically a cluster of computers, like it's many highly performing computers put together to execute a lot of jobs at the same time. So each individual computer in the cluster is called a node. And what is compute Canada? Well, it's a national platform of many centres with a lot of these high performing computers at the same place. And people in research can connect to them and launch compute jobs that will execute with very high needs in the storage space, very high needs in RAM for execution in those things. And so yeah, there's many centres across Canada, although like right now there's compute Canada is in kind of a restructuration. So like, we're starting to have to see less sites, but with many, many more machines. But at the moment, there's a six like provincial or local, I may say, consortia within compute Canada, which are ASenet, Calcut Quebec, Sinet, HPVL, Sharknet and West Grid. So concepts connected to compute Canada. Well, first of all, these are again, shared resources for Canadian academia. When you subscribe to it, it gives you an access with free compute resources, free storage space and these kind of things, you get a yearly allocation. So like when you apply, you get an account with depending on the place where you apply a couple of terabytes of storage space and you get compute time to execute jobs. Well, and when I talk about jobs, by the way, I talk about launching a software that will do a specific task that you ask it to do. So like all of the bioinformatics tools that we cover in this workshop can be launched as jobs on these HPCs. So when you apply, you get a yearly location with compute time, with storage space. And yeah, that's it. So when you launch a job or when you launch a software on these HPCs, it will count toward the yearly allocation that you get. So these slides are available online as well. But so if you're interested, you can follow the URL provided here. The way to apply is to go to the Compute Canada website. You click to apply for an account, then you give your information. A couple of days later, you should have access to the Compute Canada portal and then you can apply to a local regional consortium from the list that I've given you before. And then after that, you will have access to one of these regional sites, HPCs. So concept connected to Compute Canada accounts. When you log in for the first time, well, when you log in anytime on an HPC, you are on a login node. So these are the nodes from where you get from the outside, you get in the HPC. And that's the place from where you will launch compute jobs. So login nodes are the HPC entry points. And when you launch a job, when you launch software execution, they are put on a scheduler. So these resources, these compute resources are shared among a lot of people. So the way to make sure that not everybody launches software at the same time and makes things crash in the end is because like from service out of memory is you have this concept of waiting queue of a scheduler. So when you launch a job, it goes in a way in the waiting queue. And then you wait for your turn. And depending on how much resources you asked for to execute your job, it will take more or less time. And when your turn comes, your software will start executing and you will be notified by, for example, by email if you choose so, once your job is completed. So yeah, the scheduler is a queue in which computation jobs are waiting for available compute nodes. And yeah, resources are limited. So jobs should always get launched on the scheduler. So like you can add the HPC system administrators don't like when people launch jobs on login nodes because even though they have a high amount of resources as well, it just makes the system slow for everyone. So when you launch a job at Compute Canada, always make sure it goes on the scheduler. And yes, so again, the time that you will wait in queue for your job to finish depends on many things such as how many people have submitted jobs, how long the job you, how long you think it will take for the job to execute. So like you can put a limit, like if you know like you will, you do an alignment, the alignment of a faster read set, and you expect it to take two days, you can specify to the scheduler that the job will probably take two days or maybe max three or something like that. So like somebody who says his job will probably take seven days might wait a little bit more. So like you can get a bit of priority by gauging well the amount of time you think the job will take to execute. Same thing for the amount of RAM and CPUs needed and also depending on the time you will wait depends on the amount, the remaining allocation you still have at Compute Canada. And yet once you launch a job, you can control these parameters by specifying the amount of time you want in these things. So for the workshop today and tomorrow we will use software that's pre-loaded through CVMFS on the server that you will connect to. So this CVMFS is this distributed file system which was basically, which came, was originally implemented by the CERN Super Collider Experiments for distributed experiment computation. So like many sites are computing parts of the job that needs to be done. And so CVMFS has been created as a way to distribute virtual machines across different sites. And like the GenApp project adapted this to distribute bioinformatics codes and libraries such as like a reference genomes and so on. So a lot of the Compute Canada HPCs and the workshop server you will use today and tomorrow have CVMFS available in it which means like you have the same catalog of software and reference libraries available at all places. One important concept for today and tomorrow is that we will make use of modules to select the software that we want to use. By default not all the genomic software that's available in CVMFS is readily accessible from the command line. You have to load the modules that you want to use in order to execute your software property. So to get, we'll cover that in the in the lab that we will start in a couple of minutes but just to say to get a list of all available modules with bioinformatics software and those things you can do the command module avail and that will give you the list of software with all of their versions. That gives you an advantage. CVMFS has sometimes multiple versions of the same software loaded to ensure that like if you're used to one specific version and you want to run all of your analysis always using the same version of the software which is quite important because there are differences from one version to the other. The version of the software which are in CVMFS will always be there even if there's upgrade and we add more versions you can always load the same module and your entry execution will be the same for all of your samples. So to use one of these modules so you type module avail you get the list of available software then you can choose which module to load with the command module use. No module load sorry you type module load to load the software and module use to get the to load a specific list of software but mostly the last two commands are important for this workshop. So you need to load modules as I said to launch jobs on the scheduler and okay so one last thing so for this workshop today we will not be launching job on a scheduler. We have we have enough resources for everyone to launch the jobs that we to use all the software that we will use for this workshop and all at once okay but like usually when you launch job on a scheduler you will use the queue sub command so there's a lot of parameters for queue sub as I said before you can specify the number of chord the wall time which means the amount of time you think the software will take to execute it's important to set these numbers properly because like if you if an assembly is going to take three days to complete and you say I think it will complete in 10 hours and it doesn't well after 10 hours the job gets it gets killed and then you need to start again so yeah that's why I like it less wall time means the quicker processing but in the queue but job will get killed at the end yes there's a default wall time of 48 hours I think in most you can like HBCs so yeah it's it's it's fine for most need but for example if you launch fastqc on this fastqc will usually not take 48 hours to execute so you'll you'll just wait too much to get your execution done for nothing so more question so that's it then there was just a a short introduction to compute canada resources