 All right, so you guys are going to see these splash sides a lot of times creative commons The first module we're just going to give you an introduction to the cloud because Everything that we're going to be doing for this whole workshop is going to happen Almost I would say like 95% of it is going to happen on some compute instances that we've set up in the Amazon cloud So we want to give you a little bit of background. Thank you a little bit of background on the cloud and also we're going to go through The practical exercise of getting you connected to the cloud during this lecture So these slides were actually produced primarily by Francise and we've made some minor modifications And obviously Sieben is really the guy that's helped set up the cloud in the past. So This is an overview of the learning objectives for the whole course and we're going to kind of find Follow this pattern where we do a lecture and lab for these I guess four slash five modules And we're on module zero right now The tutorials the idea is that they will provide a complete working example of an RNA seek pipeline up to expression and differential expression analysis And alternative expression analysis as well As Michelle mentioned in order to get these Pipelines to run in a reasonable amount of time with modest computer resources. We've created these kind of dummy data sets, but The pipelines as developed should work on your own data, you know with some minor additional setup And we hope that it's Fairly self-contained and self-explanatory and portable the material. We really strive for that So in this module, I'm just going to introduce the idea of cloud computing and then we're going to go to the wikis There's actually two wikis for this workshop and we'll explain about that and then go into how to log into the cloud So as a background, how many have you seen? This plot or something like it before We'll see this is actually a common like conference bingo thing now. I'd like genomic conferences the the Moore's law slides showing that the amount of Sequencing base pairs you can get from next-gen sequence technologies per dollar has, you know increased Exponentially and unfortunately the message of this slide is that Other necessary components such as disk storage and compute power have not scaled at the same rate as our ability to produce the data And this creates analysis challenges so what what we have is basically a doubling time of Next-gen sequencing around every four months you can produce Twice the data for the same price Whereas it's something like 14 months to double your storage So we're going to hit and actually we have pretty much hit the thousand dollar genome Like late last year or this year depending on you know who you ask and how they charge At the genome Institute where I work. It's about $1,100 to do a 30x whole genome So we're starting to think already about the hundred dollar genome after many years of waiting for this thousand dollar genome We already talked about the doubling times But basically the point is at some point here It's actually going to cost less to sequence a base pair again than to than to store it long-term We're not quite there yet, but Storage is a major major cost So what's a general biomedical scientist to do we have tons and tons of data coming from Experiments like RNA-seq that we're going to be talking about in this workshop a lot of labs don't really have the IT infrastructure to support that kind of efforts. I mean you need significant compute resources lots of storage and also the expertise to use it of course But expertise can be trained, but to some degree. It's just not practical for every lab to set up their own storage and compute cluster So where did they go they can write more grants and try to get bigger hardware But we would argue that an alternative to that is perhaps to go into the cloud So the particular flavor of the cloud that we're using is Amazon web services They've provided a grant to support this workshop and to give you all access to the cloud during this workshop But they're they're not the only option. So You know there are options from Google and others The main components of AWS or Amazon web services that we're going to use are called S3 which is for simple storage service and EC2 for elastic cloud computing. So S3 is a large storage service where you can keep your data and EC2 is actually the computer that you're running your analysis on and The idea of this is that it's it's ready when you are high-performance computing So you spin up computers when you need them and shut them off or deactivate them when you don't and There's basically these huge football fields Sized high-performance compute clusters throughout the world at various centers That's very expandable for Amazon And basically for our intents and purposes. It's infinite compute resources Of course, it's not free, but In terms of whatever you need to analyze Amazon for sure has enough compute power to do it So as I said some of the challenges of cloud computing Really for me actually cost remains one of the higher ones Pricing is getting better and better. It's quite a competitive sphere There are as I said Amazon and Google and other players Who are competing in this space and the prices continue to come down? but if you're in an academic setting at a university that's kind of Gives you a sweetheart deal where maybe they pay for the electricity or they pay for a Certain aspects of the facilities you may find that it's still cheaper to do your analysis At home in your own Institute or through access to a shared compute resource like Compute Canada or something like that I think people that work in the cloud industry would say when you actually tell you up all of the hidden costs That Amazon is very competitive and perhaps cheaper And I think it's no doubt that those costs I mean they just have the power of scale and efficiencies that we can't really compete with at individual centers I I think it's very likely that we're going to be moving a lot of our analysis to the cloud And there's going to be less of these kind of homegrown solutions It's not the best solution for everybody There are issues potentially with standardization And I think one of the biggest is the personal health information So for those of you that work in the human health domain that is a concern, right? You're putting your data up in this black box called the cloud and you have to be concerned about security and Who can access that data and what happens if I don't know the NSA decides to go through all of Amazon's records Is that a concern for you that sort of thing? In the US we have several acts like the HIPAA Act that pertain to this kind of information And it's a very complicated sphere. It's not at all clear to me from people. I've talked to Whether this has been resolved, but there are certainly hospital systems. I know of that are moving on to the cloud A Amazon is obviously working very hard to make sure that they're compliant with these laws and regulations and Ultimately, it's up to you or the or people that you hire to make sure that your data is as or more secure on the cloud as It would be in a locked room in your basement of your hospital or your Institute Do you think that in some ways the the genomic data? It's still a question of being behind policy where you're right that there's nothing more personal But at the same time I know in the states We still haven't officially declared that genome sequence data is the same as telephone numbers and birth dates and other kinds of PHI So I wonder like are you aware of anyone in Canada for example that allows their EHRs to be or EMRs to be put in the cloud Yeah Policies don't always make sense. I would agree with you though Okay So advantages of cloud computing we talked about some of the challenges One is of course to be received grants for Amazon, so that's great for us to be able to provide this workshop Without compute cost. It's extremely convenience. We don't have to lug Computers here that we've set up in advance. We do it sort of all virtually They're getting to be better ways of transferring large files Which in the past has been a challenge And AWS now makes it free to upload files There are a number of data sets that already exist in the cloud So if you're using things like the thousand genomes data You don't need to worry about how to get that uploaded to the cloud You can probably just find it being hosted somewhere in the cloud already and There are many useful Bound for Maddox what are called Amazon machine images which is kind of like a clone of a computer that you've set up for some specific purpose and There are some examples like cloud bio linux where it's been set up specifically for bioinformatics So all the common Bioinformatics tools that you would want to use have been sort of pre-installed and set up for you So when you connect to an instance of that type you have like basically this powerful bioinformatics computer already set up for you and Actually, we have one for this course now That's the same kind of idea so the cloud that you're going to be the specific instance Type that you're going to be connecting to on this cloud You could later in your own lab use that as your starting point if you want to do your analysis on the cloud so in this workshop if You're not very computer Comfortable if you're still new to informatics, it's going to take a little bit to adjust Because you're using data and tools that are on your computer Some of them are on the web and some of them are on the cloud And we're going to try and help you become efficient at traversing through these different spaces and using what's best The ways that we're going to interact with the the cloud in this workshop are through the command line So through a terminal application Which is basically just like connecting to your your own Unix box if you had one in a closet or a server room somewhere The only difference is it it lives in Oregon or Virginia somewhere where Amazon sets up their their computers We're also going to use the web browser in Very limited way to browse through data that's served on the on the web from your instances But there are more sophisticated ways of accessing the cloud through the web Such as the galaxy, but we don't do that in this workshop So things that we've set up in advance for this course All of the necessary data files for the most part have been loaded on an FTP server that you're going to download at various times through the lab components and then we set up this Linux instance that happens to be in a Buntu instance and we loaded a whole bunch of software for the RNA seek analysis on that instance and Then we made a kind of copy of that and Created separate instances for everyone in the class. So each of you is going to be logging into your own Instance of this Kind of RNA seek specialized Unix box The we're going to go through this but on your Name tags is a number and that tells you the number of your computer So it's really important to use your number because you're connecting to your instance if you connect to someone else's You could be both trying to write the same file at the same time or running commands that are going to kind of compete with each other and clobber each other and Believe me it will cause a lot of problems. So remember your number We've done all this setup for you so in a way, it's Convenient for the workshop to just be able to start and get into the RNA seek analysis rather than spending a bunch of time With you guys having to learn the cloud But I did want to kind of just really quickly introduce you to what that's like So you have a general idea if you want to use Amazon back home. What is this setup process like? So I thought we would just like really quickly go and look at I can get This to work You guys see that okay? So this is the AWS management console and that's where we set up these cloud instances So if I sign into this console You would have to create an account. It's not really any different than Setting up a Google account and in fact if you already have an Amazon account Who doesn't have an Amazon account right for ordering stuff or watching videos then you can use that even You just have to connect a credit card to it if you haven't already Be careful with that though right because you can start up a very powerful computer and forget about it And then you get this crazy bill and Yeah, it's it's happened to more than one of us. So By our beware You can see there's a huge number of resources and different services that Amazon provides If you look through the list, you'll see the two that I talked about These are really the only two that you would need to worry about as a basic beginner S3 is where you can set up storage so you can buy Terabytes or petabytes even if stores that you need to store your data And you're not really buying it so much as renting it and you can also set up these EC2 instances so if we go to EC2 this is kind of like the Console that shows what I have running so at the moment I actually have an instance running in my account for another purpose So if we click on that we can look in the console and see okay There's there's this computer that I call GMS review It's it's for a paper where we we set up a computer that runs some software that we want Reviewers to be able to test out. This is another really useful way of using the cloud And so that's running right now And it's an instance of this type which is actually quite a good instance So this is actually probably costing us a fair bit of money again We had a grant for that as well so you can get both education grants to do Workshops like this and you can get research grants to kind of proof principles and they're quite interested in genomics research So I think they're pretty open to research grants in this area But say we want to make a new instance you would use this launch instance button and Here you can choose an existing AMI Setting up an AMI is a little bit more advanced. So we would probably choose an existing one and you can search through the Community of AMI's where people have created one of these Amazon machine instances and made it public And that's what we've actually done for this course. I don't know if I can remember the name of it though Maybe it's cold spring harbor has one. Yeah, so here's when we set up for another course Okay, yeah, so if you have the name or kind of Part of the name that you could search for it You can find a previously set up instance and you can select that and then it takes you through this wizard Where you basically choose different options like how much compute do you need? So here there are all these different types and they tell you how many CPUs they have How much memory they have? How much storage they come pre-configured with you can always connect more storage to them There are actually Free computer instance types. So if you just want to play around You can fire up an instance with this micro instance. So it's not very powerful It's maybe like as good as your laptop or something But for testing and demonstration purposes, it's really useful and it doesn't cost you anything And you there's maybe like, I don't know five or six steps You click through and choose some options for the most part you can go with a lot of default settings And then when you're done you would launch and it would create an instance after a few minutes It kind of like clicks and whirs it takes a little while for that computer to Get set up at Amazon and then it'll say it's running and at that point you can connect to it and use it And that's exactly what we're going to do in this course, but we've just done this work for you guys in advance So that was a very brief intro to the idea of setting up your own instance At the end of this workshop There's a link where you can see a very detailed tutorial on how you would do this Which is basically a documentation of how we did it for this course So if you want to set it up there's instructions so there are prices for The instance types it says right up front. This is how much per hour that instance type costs For storage it tells you how much per gigabase it costs. So you can do like some back of the envelope calculations There are also tools that will help you try to estimate It is a little bit challenging. I find to accurately estimate how much it's going to cost Right so you can set up so I have that set up now. Thankfully you can set up alerts basically and say if my bill was over a hundred dollars tell me tell me just Give me a summary every day. How much this is costing me. So they do have some pretty sophisticated tools to help you track what things are costing But you need to be pretty proactive if you want to keep your cost down like about turning things off when you're trying to figure out what it's going to cost But you need to be pretty proactive if you want to keep your cost down like about turning things off when you don't need them and things like that Yeah for like a normal data sets normal size data sets There you go pricing Yeah, so the There's like an AWS. It could be its own course almost. There's a lot of material there and Malachi's it's almost like an FAQ and that's one of the questions like how much is it's gonna cost me and so it kind of goes through some examples But if you go to let's see They do have it. So for example here. No, you guys can't see this. What happened if you go to Amazon.com slash EC to slash pricing and you can see like the cost per hour So we could think like okay Let's say you have a tumor normal that you want to run through RNA seek and they each have one lane of high-seq data and you think it's going to take 12 hours for alignments and another 12 hours for expression and differential expression analysis or something Maybe it's 24 hours times One of these instance types per hour We'd probably want let's say, I don't know at least eight CPUs and Yeah, we could go through the exercise that It would probably be I mean you can see kind of roughly what the prices are there between like tens of cents per hour up to a dollar or something per hour for the really good instances, so Think like maybe ten to thirty dollars, but the devil's really in the details in terms of like Sometimes it's very hard to predict how long an alignment will take or especially some of the downstream steps that we're going to show you It can be unpredictable and sometimes you have to do it twice Yeah, this way there is something kind of like what you're talking about where You can set up a job and when some store our compute becomes available Your process will run, but what we're talking about here is basically like renting a computer you start it It's your computer. No one else can use it. You pay for it while it's turned on whether you're using it or not So if you're not ready to run your jobs No, yeah, no once once you like spin up that instance that that compute is Yours to use as efficiently or inefficiently as you can but you're gonna get bill the same for it either way Right, so that's the other way of going that I'm talking about Yeah, okay Michelle is giving me the we're running out of time frantic Okay, so for this workshop we're gonna get you guys on the wiki Have has everyone actually already gone to the bioinformatics.ca wiki yet. If not, this would be a good time Everyone has I guess Michelle set that up before so you should have your username and password If for some reason you've had problems getting on to this wiki Let us know with your red sticker on The main page you want to find the informatics for RNA seek Sorry, the wrong one is highlighted, but the RNA seek workshop wiki is the one we're using As I mentioned, there's a second wiki. So you may not have been here yet If you go to www.RNAseek.wiki That's where all of the actual Lab materials are laid out So the course wiki kind of provides you with the high-level information about bioinformatics workshops in general and and links to this wiki and other resources But when we're when we're doing the labs, we're gonna be using RNA seek dot wiki This is something that Malca and I set up Basically because we give this kind of a variation of this workshop and more than one venue So we can't use the Canadian bioinformatics workshop wiki at some of those other venues So we have kind of third-party location for it So logging into Amazon This you guys probably have not done yet, right? Okay So is everyone either on Mac or Windows or do we have some Linux users? Oh, we have some Linux users. Okay So if you're on Linux, you probably are I'm guessing more familiar unless you're borrowing the Linux computer But it's gonna be more similar to the Mac instructions than the window instructions You're basically gonna have to find your terminal application. So if you're on the Mac you want to go into applications Utilities and then find the app that's called terminal terminal app and start that if you're on Windows You want to start? What do we recommend these guys use putty? I guess Yeah, so that was something that was provided in the pre workshop instructions I believe that you install putty if you're bringing your own Windows laptop So at this time can you guys either start up terminal on the Mac or Linux or putty on Windows? And if you have problems Use your stickers So the the terminal application is basically an application that allows you to interact with your computer with the file system on your computer through a Command line mode right where everything is text-based It's the alternative to what we're nor more Commonly using which is a graphical user interface, right? Like Windows or Mac where it's the start button or the apple button Using the file browsers and things like that Once you start doing bioinformatics, you realize you pretty much need to do things on the command line There just aren't enough applications that have been made into the graphical type software So in your terminal application Wherever you are you may want to on the Mac or Linux at least Make a new directory and you can do that with the make dirt command MK This should have been something you guys reviewed if you did your homework before the course your intro to Linux So we're just going to make a folder to put a file in and You can do that with make dirt and in this example We're calling it the CBW folder and then you can use CD to change into that folder And then you can use alas to list the contents of that folder Which are going to be empty since you just created it if you're on Windows You can just create a directory on your Desktop or in your home directory using the usual right-click new folder make a new directory Once you have this folder, we're going to go and save What's called a key file? And this is basically something that allows you to connect securely to these instances on the cloud So that you have to have this key to connect Which is what's preventing you know someone else out in the world from noticing that we have started these instance up and just Grabbing on to them and using them for their own nefarious purposes and believe me people would do that Move on to the actual RNA sequel job So you're not going to do this right now, but just to give you an idea you can copy files From AWS to your computer So as we're doing this if for some reason you create a file that you want to keep and take home with you You will need to save that to your own computer because these instances right that you're working in are going to be Deleted after we're done the workshop Because we don't have money to just keep them running forever So there's a few different ways you can do that one of the most convenient ways is we've set up these instances to basically Serve the contents of the file system on them to the web So you can go in your browser to that same URL CBW and your number Dot dy and DNS info and it will show you the kind of like home folder of your system and I don't think you'll see much there right now, but Because we haven't done much yet But you could bookmark it yeah, so it's really hard for me to do this about the mirroring mode, but So if I go to CBW and you said I was 49 And there you can see what I have in my home directory It's probably more than what you have because other instructors have been using this instance And you can just browse through that like you would and download files Another option besides browsing Is using the scp command? So from your terminal application at the command line you can do something like this It's a lot like the SSH command so you would say scp and then dash I and give your key file Everything is exactly like the SSH command then your user which is a bun to and then your host name Which is the CBW number dot dy and DNS info and then the colon and then the path to the file you want to download and Then the destination which in this case we're using just a single dot to indicate the current location on your computer So this would go To Amazon to the workspace Folder grab the nice alignments dot BAM file and download it to the current directory on your own computer Right this file doesn't exist yet, so it'll just give an error just an example I should really learn to use those animations Okay, so at this point your laptop for the workshop should be set up If it's not you know how to go to the the main CBW wiki to get information on connecting to the to the cloud And you know how to use that wiki for the workshop All the lectures are there as well. I believe they're on both The main wiki and aren't a seek dot wiki so you can find them in either location You've read the pre lecture material right guys And you've done the previous lecture or pre-course homework And you know how to log into AWS So this is the tutorial I mentioned if you want to know a lot more detail about using the cloud connecting to the cloud setting up Amazon machine instance like what we set up for you then there's very detailed instructions there on RNA seek dot wiki All kinds of just general questions as well like understanding about the different regions and how does billing work that we had questions about So there's a lot of useful information there if you just want to really understand more about the cloud So I think we're on coffee break now Michelle. Is that right?