 Has anyone done analysis on the cloud before? OK, so that's going to be new for most of you. So that's what we're going to talk about in this intro lecture. Basically, everything we're going to be doing, except for a few little things, will be done on computers that we've activated on the Amazon cloud. This has a lot of advantages for us and for you. Basically, it allows us to kind of prepare things in a very consistent and stable way so that we know that everyone that successfully logs into the cloud is having basically the exact same experience. In the past, we used to either provide laptops that were sort of set up or have people bring their own laptops and kind of muddle through all of the myriad of OS issues with people running Windows and Mac and Linux and trying to get stuff installed. And you were asked to install a couple things like IGB. And some of you may have experienced, even just with that, some of the challenges that come with asking people to install software on all these different systems. But for the most part, we don't have to deal with that, because we're doing everything in the cloud. Can you speak a bit closer to the microphone? How does this actually work? OK, that's fine. I'll just try and stand behind it more. So this is a lecture that was originally prepared by Francis Willett, who is obviously the sort of commander-in-chief of these bioinformatics workshops and then has been modified. And we call it module zero, because it's the first of the modules for whatever reason. It starts at zero before the other modules. In this case, sometimes we go through a tutorial where we actually show you how to use the Amazon console and set up the instance. We do that in workshops where we don't have the amazing services of Zebin, who here for these workshops basically does that for you guys. So I'm going to go through that just so you kind of have a sense of what he did to set up these cloud instances that you're going to be logging into, but you won't actually have to do that part of the workshop, because it's been done for you. And then we're going to learn how to log into that instance that's been set up for you. So in this module, we're going to just go through some of the very basic concepts of cloud computing, introduce you to some of the providers. As I said, just review the process of how you would use the console to create an instance for your students, in this case, that has been done for you, and then show you how to log in. So one of the motivations for moving to the cloud is basically this problem of disk capacity and to a certain degree, compute capacity versus sequencing capacity. You can see here, you've probably seen this graph before. It's like a common topic for conference bingo. You've seen the Moore's Law slide about sequencing. It's basically the idea that before next-gen sequencing, there was kind of this linear increase in the amount of sequencing, the amount of base pairs per dollar. But at some point, it was less than the increase in megabases of disk storage per dollar. So compute storage was getting cheaper and cheaper, much faster than next-gen sequence data could be produced. But at some point, when next-gen sequencing came onto the scene, I'm talking about these high throughput instruments that you guys are all probably using since aluminum pretty much has 95% of the market. The cost or the increase in production capacity for these machines increased so much that we've actually crossed lines. So now sequencing data is being produced faster, more cheaply, than we can effectively store it in many cases. We have this issue in our institute where there are times when it's cheaper for us to throw away the data and generate it again than to store it long-term. So that's definitely a challenge. In cloud computing, it can be part of the solution to that. So we kind of sort of hit the $1,000 genome in 2015. Aluminum acclaimed victory for that milestone. Of course, in order to get the $1,000 genome, you need to have an X-10 cluster. Each of those 10 machines cost about a million dollars or something like that. And you need to run something like 20,000 whole genomes per year. But if you do that, your price is, on average, $1,000 per 30X genome before you've done any analysis or anything else. And we can start to think about the $100 genome. Certainly, I think the pace has decreased a little, but it will still continue to get cheaper. So the doubling time of sequencing has been something like five to six months. And the doubling time of storage and network bandwidth is something like 12 months. And CPU speed is even slower than that. So I'd say we're actually almost past this point already, where the cost of sequencing a base pair is equivalent to the cost of storing a base pair. So what is a general biomedical scientist to do? There's lots and lots of data. Many labs suffer from poor IT infrastructure. So where do you go? You can write more grants to try and buy and build bigger hardware. But I think a lot of people are realizing that the solution is more on the cloud. So these are a number of cloud providers. This slide is a little out of date, so there are probably half a dozen others. But these are still the major players. We're going to use Amazon AWS in this workshop. Certainly Google Cloud or Digital Ocean or Microsoft Azure or others would be equally good and probably a very comparable sort of experience. So Amazon Web Services, that's what AWS stands for. We're listing just a few of the services there. If you go to AWS, you will see they have this huge almost overwhelming list of different compute cloud storage solutions and services. The ones that we're going to use are S3, which is one of their scalable storage solutions, and EC2, which is their elastic cloud computing. And what these allow us to do is basically just buy, compute, and storage on hourly and data bit basis, just for the purpose of this course. And then we turn it all off. And they actually provide a grant which covers the cost of that. And the idea of this is that it's basically ready when you are. They have these huge facilities where you have multiple football fields of high performance computing clusters at different locations around the world. And they expand those as they need to satisfy the global demand. And we are basically logging in and using one of these small chunks of compute and one of these huge factories of compute. So some of the challenges of Amazon cloud computing, it's not that cheap. A lot of people will tell you that it is cheaper than setting up your own cluster and data storage infrastructure. But of course it depends exactly how you calculate that. So many of us work in academic institutes where they heavily subsidize things like electricity, perhaps even maintenance and support for that compute. When you factor in those costs, in some cases it may still be cheaper for you to keep doing what you're doing. But cloud computing continues to come down. So I think even that argument is getting harder. Getting files to and from there can still be a challenge. It comes down to basically the pipe that your university or company or institute has to the internet. So especially we're talking about next generation sequence data, right? We're talking about potentially huge files. And all of us have internet at home. You've probably experienced times when Netflix was slow, right? Because you didn't have enough bandwidth. There's too many people in your neighborhood are watching it. If you've ever done anything a little more sophisticated at home, like set up a web server or tried to send someone a file, you've noticed that the speed limit on upload is much slower than on download. And so you need to be able to move files in both directions. You need to be able to get to the cloud and then back. You have to get your raw data there, do your analysis. And then at some point, you're probably going to want a copy of the results. So you're going to have both upload and download challenges. And you may find that the overall connection that your university or institute or company has to the internet, which you need to access these cloud computing resources, is not fast enough sometimes. And you may spend days getting your data up there. So a lot of universities and other institutes are investing in widening their pipes basically to prepare for this reality that we're going to be moving most of our data to the cloud and doing compute on it there. So it's not the best solution for everybody. I think in the past years, it's really been obvious that there's still a lot of pros and cons. But I feel like the temperature of the room is sort of, it's inevitable. Like we're definitely moving towards working on the cloud. So I think it's a good idea to get introduced to it because it's only a matter of time. But there are some issues like, especially related to privacy concerns, so someone that is working on human data especially. So for example, we work on human cancer genome data. And this is probably the biggest remaining barrier is concern that institutes and hospitals have for putting their data outside of their control. So there are challenges and limitations to having it in the building. But you at least feel like you're kind of in control. Those boxes are in a closet somewhere on your hospital basement maybe. And you can lock the access to that room and you can try to secure it. And you can know that you're being compliant with certain health and privacy and security regulations. So in the States, these are things like the HIPAA Act where information, personal health information is protected. And there are severe penalties for exposing people's PHI to the world or to anyone, apart from themselves and certain authorized people like your doctor. So that's a big issue. The cloud providers are becoming HIPAA compliant. Basically developing the security protocols to make sure that your data will be safe there. And institutes are forming agreements with cloud providers. And the agreements mostly have to do with things like liability, right? So they recognize that potentially something could go wrong and some data is compromised. But if they put their data on the cloud, the question is basically, who's liable? So they have to come to an arrangement between Google or Amazon or Microsoft and your company so that everyone's sort of comfortable with what will happen if things go wrong. Maybe just one last point. I think that a lot of people are a little bit nervous about putting stuff on the cloud for these reasons. But I also think people don't realize maybe that their data isn't necessarily safer in their hospital basement. It might feel like it is, but it's probably actually potentially easier to hack that than it is to hack a computer at Amazon, right? I mean, this is their business. So they are probably leading experts on securing their stuff. So I think eventually that argument will win over. It's sort of interesting for you guys in Canada as well. I know this is another concern I've heard that you're not just sending your data to an external third party, but you're sending it to an external third party in another country, a country that doesn't have a great record with respecting the privacy of its citizens' information. So I think that is actually a really legitimate concern for non-U.S. actors. And when we're in the States, it's like we're stuck with NSA no matter what. So it's kind of like a moot point, right? But outside the States it's definitely a legitimate concern whether you want to put your data on servers that especially as non-citizens, the NSA respects your rights even less than they respect Americans' rights. So they could be, I don't know what they would do, but they could be looking through the patient data. So that's another issue that would have to be dealt with. I suspect companies like Amazon will have to start like a Canadian and other country branch so that you're dealing with a national partner in some sense. If you are not dealing with human data, probably none of these concerns really matter to you, but it's something we think about all the time with human data. So those are some of the challenges. One of the advantages is Amazon is very generous in terms of supporting education grant awards. Of course, this is very self-serving that you guys are exposed to it and use it and you go back to your lab and you want to try out the cloud there figuring this is what you're already familiar with so you're going to become Amazon customers. So that's the trade-off there. It's getting easier to transfer large files and Amazon makes the upload of many files free and a large number of bioinformatics data sets that you may want to use are already on the cloud. So this solves the problem of you having to have the bandwidth to get your data onto the cloud if it's already there. So this applies to things like large public data sets like the thousand genomes data or TCGA data or ICGC data. A lot of that data is already on the cloud. There are also many useful bioinformatics images. So we're going to use an Amazon machine image that Zbin has set up. There are many others like Cloud BioLinux and Cloudman and others and there's actually one specifically for this course as well where basically you can start a computer that exists on Amazon that has all of the software that you want already installed. So that's a big advantage. You don't have to spend a lot of time setting up a computer from scratch. You can set up one that's sort of geared towards your application. So in this workshop, some of the tools and data are on your computer, not much, but you're going to use probably IGV and maybe FastQC on your computer. There are some things on the web, obviously the Wiki and all the resources there. And then there's going to be data and compute that you're going to be using on the cloud. And hopefully by the end of this workshop you'll kind of be comfortable moving between these different spaces and finding the resources that you need. And there are different ways of interfacing with the cloud. We are going to be using just the command line essentially. So we're going to log into a computer using a protocol called SSH. And then it'll be just like you're at your own very powerful Unix box. It won't really be any different than if you logged into a server in your lab or whatever. But there are also web-based ways of interacting with the cloud. So you can spin up Galaxy instances on the cloud and then access those Galaxy instances through a web browser, for example. We're not going to do that in the workshop. Pretty much any time you go on the web you're interacting with the cloud. So most web services now are running on the cloud. So you may not have realized it that you've already been operating on the cloud for some time. So what we've set up, we've provided data files on an FTP server. We set up these Linux instances and loaded them with a bunch of Next Gen Sequence Analysis software. We then cloned this computer and made separate instances for everybody in the class. So this is what Sieben did to make your lives easier. So basically you can just, we're going to give you a key and tell you how to log into this computer and then you're good to go. For these instances we've set up very simple security. So you definitely do not want to put any PHI or human sequence data on these instances. They are basically completely wide open. They have the same login and file access for everyone so you could all trivially access each other's instances. And they're also open to the worldwide web, which is a convenience for us because we can just download files through a web browser but would clearly not be at all secure if you actually had a sensitive data there. So just keep that in mind. If you're setting up a cloud instance for yourself, this model is not one you would want to follow for security. So there's a lot of documentation. Malachi with some help from me, mostly Malachi developed a really nice intro to AWS cloud computing. I think the best thing about this, it kind of walks you through some of the same stuff I'm about to do in terms of using the Amazon console but it also goes through a lot of the terminology. There's a lot of acronyms and terminology thrown around when you start working with AWS. Things like S3 buckets and S3FS and different instance types like M3.xlarge and so on. There's a lot of new terminology and he did a really good job of kind of explaining all that and then answering some of the really common questions that as a naive user you would think of and are not actually that easy to answer from Amazon documentation. I think it's just too massive. So he's kind of summarized it in one place. So questions about things like billing, like how do I know how much this is gonna cost me? So those are really good resources and the link is there and it's also at the RNASeq Wiki and then we have a link to the console there. So logging into the Amazon AWS instance, this is the part that you don't have to do, that you would have to do if you've been wasn't here. So logging into the AWS console is actually not really any different than logging into Amazon to go shopping there. You can actually use the same, if you have an Amazon account, you can use your same credentials to log in. You have to sort of agree to the AWS agreements. I'm sure there's like a box you have to check and you have to provide permission for your credit card to be charged. So be cautious if you're playing around with this that when you set up your Amazon AWS account you really can't, there's a lot of free stuff you can do but beyond that you get billed on a basically per minute basis for everything you do. So they wanna have your credit card like pretty much right away. Even to do the free stuff, they wanna access to a credit card in case you stumble into the non-free stuff. So you would go to the AWS console, create an account if you don't have one already and then you would sign in. When we do this, we provide kind of like a sub-account on our account. That's why this login page looks a little different. And these are the services I was telling you before about. So you can see that Amazon has a huge number of different services and I don't know what half of these do but right at the top left is EC2 and that's the one that we're using. So that's this elastic compute that we've created instances of for you that you're gonna log into. So if you were doing this, you would log into this or select this EC2 option. You'll also notice in the console that there's this thing called a region. I don't know if you can see that in the small print. This says Oregon. So those huge football sized containers, they actually exist in physical locations and you connect to specific ones and it kind of matters a little bit which one you connect to. So if you look through the pricing, you can see that sometimes for whatever reason, storage is cheaper in the Oregon region or in the California region. I think Oregon is often the cheapest. So that's what we usually use. Most of the pricing is pretty much the same but it also matters because if you set something up in that region, it'll be more visible to people in that region than people in other regions. It will also affect the speed at which they could access for example, a web server. So let's say we create a website and we set it up in the Oregon region. People in that geographic area because they have less nodes to go through to get there are gonna have slightly better speed and performance accessing that web service than someone on the other side of the planet. And big companies that are providing worldwide services would basically set up their web services in all these different locations so that everyone has access to it at optimal speed. But then of course that means they're paying for it multiple times too, right? So there's this trade off. For our purposes, it really doesn't matter. We just pick a region and we stay there. So once you're at EC2, you would see your console and basically there's a big button that says launch an instance in blue there. And we would launch an instance and basically go through a kind of like a wizard which is not so different from many other things you've done online or in different pieces of software. We're just gonna select different options about what kind of instance you want. So we're gonna in this case choose an existing AMI. So this is this cloned image that was set up for you. So that's a whole other aspect of it that even if we had you set up your own instances we're not expecting you to build up the instance from scratch. So we logged into Amazon. We started like a very plain Linux instance. We installed a bunch of tools and then we froze it and saved it in this thing they call the community AMI section. So if we had you do this, we would have you go to the community AMIs and search for that instance that we kind of preset up for you. And once you find it, you would select it and then configure it. You can run that instance on all kinds of different compute options. So you might want lots of memory or lots of storage. You might want many CPUs and all of these different configurations have different pricing. So we kind of chose like an intermediate. I'm not sure what ZBen shows for this workshop. We've typically used one called m4.2xlarge which if I can read the tiny print there has eight CPUs, 32 gigs of memory and no pre-specified storage which we configured separately. So there's about, I don't know, 50 different options and I think you can even customize further than that. There's an option to protect against accidental termination. So this is something, especially when we have like a large number of student instances on one account, you have to be careful about. So checking this just gives you a little bit of extra security against accidentally killing the computer that you set up. So if you start it and start doing a bunch of analysis and then someone accidentally terminates it, as soon as you terminate it, you stop paying for it but that means that those resources like that data on that hard drive goes away, right? So that becomes available to someone else. So you'd lose your work. So you have to be really careful about stopping or pausing your instance versus terminating it. But if you activate this option, it basically gives you an extra step of warning. Like are you sure you wanna do this because you're gonna lose your stuff? Next we would add some storage. In this case, just like we created a, an operating system with pre-installed software, we also created some kind of virtual hard drives with pre-installed data. And again, we're letting the students just find and attach those volumes. So it's sort of like, it's basically like giving you an external hard drive and saying, here's all this data on it, just plug it into your computer. But it's all done virtually. You can give your instance a name. We do this in the workshops where the students set up their own instances so that they can tell which instance is theirs. Otherwise they have these kind of long, very hard to remember names like random letters and numbers. And then we set, again, this very permissive security setting, which we pre-configured. It's a very simple configuration which basically allows SSH and HTTP access in and out to anyone. So those are all the options you choose and then you review and launch. It warns you that this is not a free instance. You're gonna have to pay for this. It tells you that your instance is not secure and then you can hit launch. When you launch the instance, it asks you to pair that instance with a key. And this is the one part of security that we do require and it's basically a file that you download where you need to be in possession of this file to access that instance. And this is how it is going to be for this workshop. So we're gonna tell you where to find this key. You're gonna download it and you're gonna use it to log into the instance that we've set up for you. So it'll say your instances are now launching and there's a button called view instances which takes you to the main console view. And here you can see a list of instances that you have running in your account. So this is like your dashboard. This is the thing that's linked to your account that tells you what is going on in Amazon related to your account. You can kind of browse through the left here through the different options and see what EC2 instances do I have running? What storage do I have spinning and so on? And this is also where you would do things like stop or terminate your instance. There's a button somewhere that lets you take note of your IP address. This is one of the ways that you can basically log into your instance. So as soon as it's created, an IP address is assigned to it just like your computer at home when you got shock cable or charter or whatever an IP address was assigned to your computer and IP address is assigned to this virtual computer as soon as it comes into existence and you can use that IP address to connect to that instance. Again, to make your lives easier, I think Zeeman he usually maps the IP addresses to like a more human friendly name like CBW1 or something like that. So I think you guys are gonna use a domain name to log into your instances instead of the IP address. So we're gonna try and connect to our instances now. Okay guys, I think everybody is connected. So we're gonna just quickly go through what you just did to try and make it a little bit more make sense, hopefully. So you just downloaded this key file. We're doing it a different way this year. So the way that you did that looks a little different than in the slides. Before we had a private wiki that only you could access with the password and then you could just download it. But now this wiki is public, so instead you have to enter a password when you try to download that file. The reason for that is that this key file is the key that gives you access to these instances. And even though we talked about how insecure they are, we don't want it to be that easy for other people to hack in. There would actually probably be like bots roaming around looking for unsecured computers that would interfere with the course if we didn't have a key file. So that key file you downloaded is called cvwcg.pem or pek. What it actually is, if you go to the command line find it and use cat to print out the contents of that file is just this huge long string of random numbers and letters. It's kind of like a really strong password that tells the computer you're connecting to that you have permission to access it. And that works because when we created the instance, remember there was that option that says associate a key file with this instance. So we associated this key with that instance and we used the same key for all of you just so we didn't have to have every person download a different key file. But basically it's sort of secure for the group. So that's what it would look like. Then there was also a thing where we changed permissions. So we want that key file itself to be more secure. We don't want people to be able to write to it or execute that key file because if they change it then it won't work, right? So one of the requirements the way Amazon is set up is that this key file have certain permissions. And in Linux the permissions are broken into read, write and execute which is represented by this R, W and X. And through a kind of very old formula from the early days of the way Unix worked, you can specify these permissions using different combinations of numbers. So for example, a four gives you read access. Four plus two would be six would give you read and write access. Zero gives you neither read, write nor execute access. And you're applying these permissions at three levels. One is to you, the user or owner. One is to a group so you can define a group that you belong to and one is to the whole world. So in this case when we'd say change CHmod is basically a command to change permissions of a file and give it 400 or 600, we're saying apply certain permissions to you, the owner and then basically no access to everyone else. And so I don't know if you guys created a separate folder to put your key file in or just left it in the downloads folder but you should have run this command CHmod and then that allows you to SSH or use a, it's a secure connection protocol to connect to the instance using that key file. And this command is set up, very touchy, as SSH dash I specifies the location of your key file. Ubuntu is the name of the user. So it just happens that these instances that we've set up have a user called Ubuntu and that's what you're logging in as. And then you're giving an IP address or in this case as I said before a kind of domain name that's matched to that IP address to log in. So during the course you're gonna sometimes be asked to copy files from the cloud instance to your computer. You can do this using a web browser and you can do it by basically just browsing to the IP address or the domain name of your instance. If you put that into your browser you'll be able to see all of the files that are in the workspace of that instance already. We haven't really done anything yet so there won't be very much there. At the end of the day if you wanna get out of your instance you can simply type exit. Yes. I'll log into someone else, there's no link. Okay, so just to finish up, if you were setting up these instances yourself as opposed to just logging into ones that Zbin has set up we would ask you to stop your instances at the end of the day. This just saves money, right? So when you stop them you still have to pay a little bit to store the data but you stop paying for the compute because they're kind of like shut down. And then in the next morning we would ask you to just start them again. Remembering that stopping is different from terminating. And this is when having the different instance names came into play so that you would search the console and find your name and know what your instance IP was so that you can log in. And last time we did this course in New York one clever student realized that since you all were given access to the same console he could go in and change the names of the instances so that people wouldn't be able to find their instance the next day. But fortunately he used nicknames for the people that they recognized. And they were mostly politically correct so it wasn't a total disaster. So at this point your Mac is ready for the workshop. If not, if you're having trouble remembering how to log into your instance remember that you can go back to the course wiki which is I think probably on the board bioinformatics-ca.github.io find the RNA-seq workshop and then scroll down to the instructions for logging into AWS.