 So, this slide was originally created by Francis O'Net, and you're free to use this slide if you want, you can modify it, but if you use it, you have to make it open. This is a CPW email and announcement at mailing list. We encourage you to subscribe to this mailing list. It's very, very low traffic. I think Michelle only make announcement several times during the year, so it's easier to keep track of our workshop. And again, my name is Zeven Lu. I'm from Prince of Margaret's Genomic Centre. Now I'm part of HPC for Health, and HPC for Health is a satellite site of Compute Canada. I also work part-time under Francis OSER here to maintain bioinformatics core facility. And today I'm going to basically talk about cloud computing. And in this talk, I'm going to give you a brief introduction to the cloud computing. And after this talk, you will be familiar with our wiki, all materials for the workshop will be in wiki. And I also show you how to log in, I also guide you how to log into the cloud. And if we have time, I'll show you Amazon AWS Management Console. So after you go back home, if you have Amazon account, you can launch your own instance. So the first question is why cloud computing? I think the main reason is because we are dealing with big data. Now our data sites are reaching the pet-based scale. So one pet-based equals to 1,000 terabytes. And one terabyte equals 1,000 gigabytes. For example, a DVD movie usually is about four gigabytes. So in our center, we bought one pet-based storage last year, and we almost used up this in one year. So the data site is getting bigger and bigger. So it's easier for us to move our software to the data set instead of the other way. So when I learned programming, I was taught to read the data into memory, either into a read or hash table, and then process that with some algorithm. And finally, you write your final results into a file. But nowadays, the data site is always somewhere from the net, not local. So when you think about to process the streaming data and also process data parallely. So this image was created by Lincoln-Stan, who is the director of Informatics and Biocomputing Platform here, OICR. And this line shows the number of base pairs $1 can generate before the next sequencing. And the doubling time for this number is about 19 months. And this line shows the number of hard disk storage in megabytes $1 can purchase. So compared to these two lines, the storage is cheaper than the sequencing, so we can store our data onto the hard drive. But when the next gene sequencing came out, the number changed. This line shows how many base pairs $1 can generate with next gene sequencing. So this line is sharper than this line. So in the near future, we'll be out of storage. So we are now talking about $1,000 genome. And this is basically just the reagent and what life cost. When we think about next gene sequencing cost, we also need to think about storage and also about cost of analysis, like salary of bioinformations. And the doubling time of the reduction of sequencing in cost is in many months range. And the doubling time of storage and network bandwidth is very small number of years range. And doubling time of CPU speed is about 18 months. In the very near future, the cost of sequencing of base pair will equal to the cost of storing a base pair. So what is the general biomedical scientists do? We are facing a lot of data. And in most, in many lives, even in the institutes, even some hospitals, they have very, very bad IT instructor. In all days, the IT instructor is built for mathematicians, for mathematicians, physicians, but it's not for bioinformatic mathematicians. They really have powerful CPU, but less storage. So what do they go? They can write more grants to buy powerful hardware, to buy more storage, but they can also look into the sky, look to the sky, and there's something interesting over there, the cloud. And some genomic company is already there. The typical sequencing company pipeline with cloud computing is they get the sample from Light Lab and they sequence the sample from some sequencer, they generate raw data, and then they ship the raw data to some cloud provide, cloud computing provider like Amazon. And they do all the hard work over there. And also, most people are familiar with cloud, even they don't know the term. If you, I will use Google Doc jobbox, you're using cloud. You save your document on one, for example, you save your document on your laptop, you can access it from your desktop, from your cell phone. So the storage is in the cloud. And Netflix and Twitter also use cloud computing. One example for cloud computing from Amazon, it's called Amazon Web Services, it's a collection of remote compute platform. And it's marketed as a service provide large computing capacity more quickly and cheaper than a client company building a real actual physical server form. The most famous services, the storage is called a simple storage service, also called S3 service. And otherwise, elastic cloud computing is also called EC2. And if you have an Amazon cloud, you can just launch EC2 instance within minutes. And also, if you have money, you can buy storage from S3. To end the user, the storage is almost, like, unlimited if you have unlimited money. And they have, at Amazon, the servers across the world, and they cross the different regions, like they have servers in North America, Asia, in Europe. And we see each region, they have this several zones. And each zone has this multiple football fields, large compute resource. And basically, they have this, they have this big container, if you just plug in the power, and then it's ready to expand. But we have some challenges when we use cloud computing. They are not cheap. You pay how much storage you use, and they also pay by hour your instance. And, for example, our instance is called, I'm 3x large, we pay each instance 28 cents per hour. It seems not so expensive, but if you times 24 hours a day, how many students in the class, how many days? So, at the end of the workshop, we end up a $1,000 bill. So, another problem is, if after class, if you've got to forget to turn off the instance, at the end of the month, you can imagine your bill. Do you store some of the cancer data on this public storage capacity? I'm sorry, what was it? Do you have here for the cancer people? So, do you store the data there? It's things that cancer data is being stored. It's a project that's being stored in the cancer data file. Is that what you're asking? Yes, yes. Are there any competitors for this service? Yeah, I'm going to mention that Amazon Cloud is just one of the cloud computing providers. Google has the cloud computing services, Microsoft has other providers. You can choose your own flavor. Another problem is getting files to the cloud and getting the result from the cloud. I just, as I mentioned, I haven't mentioned that yet. So, for a typical Illumina high-sick sequencing, if you do parent 100 base parallel sequencing, we see maybe two weeks, you can generate raw data about four terabytes. So, you can do the calculation. If you transfer the four terabytes raw data to the Amazon cloud with internet, how long that will take? You can do the calculation. This cloud computing might not be the best solution for everybody. If you, for example, if you want to host a website on Amazon cloud, the more people access your website, the more you need to pay. You have to pay the people download files from your Amazon instance. And also, as I mentioned, there are several cloud computing providers that there's no standards. So, if you have instance on Amazon, you cannot just transfer to some providers like Google. It doesn't work over there. You have to either work from scratch or do some conversion yourself. Another problem is if you're dealing with personal health information, the security is one of the concerns. If you store your data on cloud, it's just like a black box. You don't know why it is. Maybe in North America, maybe in Europe. So, if somebody, if you are asked why there's data, you cannot give a simple answer. And if the data, and in the United States, they have this Patriot Act. So, you want to make sure you are comfortable with, US government can touch that without giving you an notice. Basically, if you can store your data in the US, I don't think we are allowed to store our personal health information outside the country. I don't think we can do that. But I think Francis just mentioned, NCBI just allowed their users to use Amazon Cloud, maybe about a couple of months ago. They started to use the cloud computing. In fact, Amazon is more secure than most of universities and in the states. But still, you will go through all the paperwork to get permission to use cloud computing with your personal health information. I think the way they have been using Amazon for several years, the service is getting better and better. I think the benefit of cloud computing is extendable. You can launch your own instance within minutes. If you want to contact your local computer resources, you have to, if you want to buy your own, you have to purchase your hardware, you have to go through all the paperwork, you have to have a system admin to manage your system. We do have some advantages with cloud computing. As I said, at CPW, we have been using this Amazon Cloud AWS for several years. It suits our needs very well. We give everybody in the room a separate instance. If you mess up with your own instance, it's okay. Nobody else will notice that. And after the workshop, we also make AMI. It's called Amazon Machine Image. So if you have your own Amazon account, you can log into account, launch your instance based on our AMI, and you will have the same working environment within minutes. We also get grant from Amazon. So if we manage our expense very well, we basically use the service for free. And Amazon is working hard to make it easier for users to transfer files to their service. In fact, if you contact them, you can even send hard drive to Amazon. They just plug in your hard drive into their server, and then your data is ready. It's over there. And also, Amazon makes it free if you upload files to their server, but you have to pay for the download. And there are a lot of data sets. A lot of data sets exist on Amazon, like 1000 Genome, if you want to use 1000 Genome data is already there. And also, there are many useful mathematics AMIs, Amazon Machine Image exists on AWS, like CloudBioLinux and CloudMan for Galaxy. And we also, after our workshop, we create a consolidated AMI from our workshop. And if you open an Amazon account, you can launch your own instance based on this very useful mathematics AMIs. Is it free to access to 1000 Genome data? Yeah. You don't need to pay. If you download something from Amazon Cloud, you don't need to pay. I think they have deal with Amazon. Amazon hosts the data for free for 1000 Genome project, or they pay less. But if you want to download it to your local machine, or to your Amazon Cloud, it's free for you. Just like I said, there are many flavors cloud available. You can choose your own flavor. In this workshop, we have some tools on your local computer. We also some tools on web, and we have a lot of tools on cloud. We will help you to become efficient at transversing this various space and finding resources you need and use what is best for you. There are different ways to use the cloud. The most common way is use command line. Just log into Amazon Cloud, like log into your HPC cluster or UNIX box. Some AMI has GUI, so you can use web browser to use their tools like Galaxy. We're going to use Galaxy in this workshop. When talking about big data, big data is really a relative term. This image is 5 megabytes hard drive, looked like in 1956. Who can imagine what it will be in 2056? I think Bill gets predicted once. No personal computer need more memory like 640 kb. So since we have set up, we load the data file to AWS, and then we bring BrowTop to Ubuntu. Ubuntu is one flavor of Linux. BrowTop, Ubuntu instance, and then we load a whole bunch of software for NGS analysis. Then we clone this and made separate instance for everybody in the class. We also simplified the security. So everybody will have the same login credential to login into our instance. But when you go back home, you don't want to do this. You don't want to share your private key. You don't want to share your username password with others. And we also installed web server on every instance. So your workspace is accessible to the world. But you don't want to access your system admin to open your home directory to the world. We are going to use SSH to connect to our instance on cloud. SSH is encrypted network protocol for users to connect to remote machine or server from their local machine laptop and desktop. The message between client and server is encrypted. That means it's quite secure. But if a hacker sits between you and the server, there's a potential he can get your data, decrypt it if he has time and compute power. So when SS client successfully connect to the server, the client will keep a copy of server's fingerprint. When you try to connect the server next time, the SS client will ask for the server's fingerprint first and compare the server fingerprint to your local copy. If they don't match, the SS client will refuse to connect. So every time you are sure you connect to the real server, the only problem is the first time. So when you first connect the server, there's no local copy. So the SS client will warning, will give you warning, says do you want to accept this fingerprint from the server? Usually you say yes because it's the first time. And to make it more secure, the SS protocol implements public key authentication besides user name and password. User name and password is not so secure. The public key authentication requires two, one key pairs. One key is stored on the server side. It's called public key. And one key is stored on a local machine called private key. The server and the client will never exchange these keys. So there's no key transfer through the network. Message will be encrypted with the public key on the server side. And only the client with the private key can decrypt this message. And it's almost impossible to recreate this private key with public key. So if your server got hacked, your public key is safe. And also you can protect your private key with some passphrase. So if you lost this key, nobody else can access this or even visit this key without passphrase. But the problem is if you lost your private key, you cannot access the server. So everything will be on the key. So in a few seconds, I will show you how to log into, log into Amazon cloud. But first I want you to open the wiki page. If you have problem log into wiki, you can use red sticker and Michelle will solve the problem. Our wiki is high support sequencing. So we can open wiki and this is our wiki page. And I'm going to show you, guide you how to log into cloud. So are we ready? Okay. So we need access the client to connect our instance on cloud. So for Linux user, I suppose you know how to open terminal. For Linux and Mac user, we need a program called terminal to connect remote server. For Mac user, if you don't know how to open terminal, you can go to applications and then click utility. And then there's a black screen icon called terminal. You can open that. For Windows users, I think we asked you to install a program called patty. This is the access client for Windows user. And I just got warning from computer Canada last week. There's a charging version of patty available on net. So make sure you download the patty from the link we sent you. Don't just go to some search engine and choose a topic. In fact, for most software, you want to make sure you download the software from the original also. Don't just go to search engine and download software. And for Mac user, when you open Mac and Linux user, when you open terminal, you are located under your home directory. If you do a PWD, which means present working directory, you will see something like user, your username, or home, your username for the next user. If you do LS, you see all the contents under your home directory. So I want to, I really make a directory for this workshop. You make directory called CWD by makedir space cbw. And then we can go inside this directory with cdcbw. And then we do ls-la see the content of this directory. It's supposed to be empty. There are two special files, the dot means current directory and dot dot means parent directory. There are always just two files under every directory under Mac and Linux. For Windows user, you can just open Windows Explorer and go to desktop and right click some white space and you will see this you will see this pop window and choose new folder and give folder name cbw. If you have problem, just use the red sticker. So in this workshop, we are going to use a public key authentication, so you need your personal key. The personal key is on the wiki page for Mac and Windows and Linux user, you download the key for Mac and Linux. You right click this certificate. If you are on Mac, you don't, if you don't have a red button, mouse button, you can just use ctrl click and then choose save link s and save this file into the directory we just created, cbw, your home directory cbw and save this file. This file is supposed to be called cbwny.p.m. For Windows user, we want you to download the Windows certificate. You right click this certificate link and save link s save to the directory we created on desktop cbw. This file will be named as cbwny.ppk. If you attend the cbw workshop before, you have to re-download this kit, the kit is different for every workshop. So for Mac and Linux user, if you do isla, you will see the file, the key file you just downloaded is called cbw.p.m. And this is the long list ls. So the first part is the permission. You can see this file is readable to the world. So for the private key, we need to change permission. So we do chmod 600 cbwny.p.m. And then after you do that, you do is-la again. You can see now the file is only read and readable to yourself, to the owner. For Windows user, we don't need to do this part. So a few words about Linux permission. So when you do a long list, the first position is usually a d or dash. d means this file is directory. By the way, directory is a special file on Linux. And the next, the others are permissions. So the first three letters stand for owner permission. The next three stand for group permission. And the next three stand for word permission. So if we change permission to 600, and then that means read, write, or to the owner only. So we make this key private to yourself. Now we can use this key to connect to our server. And the Mac and Linux, we use a program called SSH. This SSH client for Mac and Linux, we just type SSH and dashi cbwny.p.m. The dashi tells SSH to use this cbwny file as private key. For Windows user, we just click the class sign beside connection and click the class sign beside SSH and click off. And click button, browse button. Find the file you just downloaded. The file name is cbwny.pbk. And then click OK. Okay, if you have problem, you use write sticker. And next, we tell the program of username. Username, we use unique username Ubuntu for everybody in class. So for Mac and Linux user, you just type space ubuntu. And for Windows user, you click data and the connection. And then in the box beside auto-logging username, you type in ubuntu. And then we tell our server name. So for each student in this class, you will have separate instance on cloud. So you just write off to ubuntu and for Mac and Linux user, type at cbw, then replace this number sign with your number, with number on your badge. Everybody will have different number. Make sure you use this number. cbw your number and .dyndns.info. For Windows user, you go to session and in this, and at the right side, there's a host name or IP address box. In this box, you type in this cbw your number and .dyndns.info. And then you want to save this session. For Windows user, you can type cbw under save session, and then click save. For Mac user, we cannot save the command, but we can save the terminal. We can keep the terminal in doc. You right click this terminal icon and then choose option to skip in doc. Next time, you can just double click this. You can click this icon and you will find the terminal easily. So for Windows user, you can just double click cbw from now on, and then you will be logged into your instance. For Mac user, you just hit enter. Remember, I talked about first time to connect server. You will be warned that you don't have a server fingerprint unlock machine. You need to accept the server fingerprint for the first time. Okay, if you are logged into your instance of the cloud, you can take your coffee break. If you have problem logging into the cloud, just use red sticker. As I mentioned, we have web server installed on its instance, and the workspace on the home directory is accessible from through the HTTP protocol. If you open your browser, you type in cbw your number.dyndns.info. You will see the content under your workspace. Can I try that? You will now see all these directories, but you will see maybe module 3 module 7. If you see content with this HTTP protocol, can you use green sticker? Okay, thank you. So, you should have a red sticker to sign it in. Oh, you are done. Yes. I think, did you hear me on me? Because I was like, it's everything, because no, because I had a different version yesterday. I overwrite everything, so I won't let you in until, because I'm paying for you. So, if you started yesterday, then I'm definitely paying for you for every hour that you are open. Okay. So, I'm not paying for you. This is 50,000 stars. And I will be paying for you tomorrow.