 My name is Xibin Lu, I'm from HPC4house. I'm a system administrator at HPC4house. I also do data analysis for next-gen sequencing data analysis. I also work part-time under Francis at OSR to maintain AWS for CVW workshops. Again, this slide was created by Francis Olade. And you feel free to use any of the slide. But if we believe in open source, if you use any of the slide, we ask you to open your slide also. Today, I'm going to briefly talk about the cloud computing and guide you guys to log into our Amazon instance. So first question is, why cloud computing? I think the main reason is we are dealing huge amount of data. Our data set are reaching pet-based scale. For example, for a typical Illumina high-seq pair and run, you really get about 500 gigabytes raw data. So it's kind of challenge to transfer, store, and process this huge amount of data. The data will not be on your local laptop or desktop. It will be somewhere on the network. So compared to the raw data set, it is smaller. Your software or tools to handle this data is kind of small. So it's easier to move your software or tools to the data than the other way. So when I learned how to program, I was told the general procedure is you have your data and then you load this data into memory. You start in a real hash table. You process your data. And after you finish processing your data, you can output your result to a file or print on screen. But now we are dealing huge amount of data. Maybe we need to think to process streaming data. A very simple example is if you want to do a summarization for numbers, you can load all those numbers into memory and then you add them one by one. And you may think another way, you can just load one number at a time and add up into a variable or something. So when dealing a huge amount of data, you need to think memory is always an issue. So this graph was created by Dr. Lincoln Starr who is a scientific director of OICR. And this yellow line shows how many base pairs we can get with $1. This is before next in sequencing. And this blue line shows how many storage we can purchase with $1. So the storage will be cheaper than the sequencing. So we have no problem to start our sequencing result. But this was before next in sequencing. So when next in the generated sequencing came up, things changed. So the sequencing cost becomes cheaper and cheaper. So this red line shows us how many base pairs we can get with next in sequencing with $1. So the sequencing will be cheaper. So in some day, we will auto-storage. We cannot start all the sequencing results. So we now have about 1,000 genome. But when we talk about cost, we need to also think about the storage, think about how much does it cost to process the data. And the doubling time of reduction of sequencing in cost is in many months range. And doubling time of storage and network bandwidth is a very small number of years range. And doubling time of CPU speed is about 18 months. So the cost of sequencing a base pair will equal the cost of storing a base pair in the next very small number of years. So there's always a question, do we want to start our sequencing results, or we just re-sequence our sample when necessary? So we are dealing a lot of data. And usually, the IT infrastructure in the institute in the hospital is quite poor. So what can we do? We can definitely need more money to buy bigger hardware. All you can look into Sky. And there might be some solution over there. That is the cloud computing. So this is a typical pipeline for cloud computing. Use cloud computing to process your sequencing data. So you prepare your sample in your lab. And then you sequence your sample in a genomic center. You got your sequencing result. And then you shape or transfer your data to a cloud computing platform like Amazon. And then you do all the heavy duty work over there. Cloud computing is not new to us. If you ever use Google Doc, or Jalbox, or watch Muay from Netflix, and you are using cloud computing. And recently, Illumina just joined cloud computing family. Amazon AWS, which is Amazon Web Services, is a cloud computing platform to make up on-demand compute platform. They have a lot of services, but the most important two are storage and high performance computing. The storage at Amazon is called Simple Storage Services, also called S3. And this is the object storage. If you have an Amazon account, if you have enough money, the storage is infinite. And AWS also provide elastic cloud computing. It's called EC2. This high HPC computing is charged per hour. So the high performance computing is already over there, if you have an Amazon account. And Amazon has this multiple football field size HPC throughout the world. I think they have 12 regions currently. And this infrastructure is very easy to extend. They have this huge container. We just plug in the power. It's ready to expand. When we do cloud computing, there are some challenges. So first, they are not cheap. While using M3 X large instance in this workshop, it costs us about $0.30 per hour. It seems not so expensive, but you also need to consider the memory, not the storage, and also the network transferring. And if you add times 24 hours per day, several days and 30 students, it's end up a big number. In fact, when we first use Amazon cloud computing, after the workshop, we've got to shut down all the instance, and at the end of the month, we receive a very big number bill. That was the first mistake when we use Amazon cloud. Because we are dealing huge amount of data, transfer data is a challenge. How do you transfer your data to the Amazon cloud and transfer the result back? You need to think about that. And Amazon makes it free for you to transfer data to Amazon cloud, but when you download data, you need to pay. So this may not be the best solution for everybody. For example, if you are hosting a website on Amazon cloud, the more people access your website, the more you need to pay. So another problem is the cloud computer has no standard. So if you have instance in Amazon, it's not easy to transfer to other cloud computing providers. A big challenge for us is when you're dealing patient data, you need to think about security, because Amazon is in cloud, and you want to make sure your hospital allows you to upload data into Amazon. And Amazon is a US company. If you want to use Amazon cloud, you want to make sure you are comfortable with US government has ability to look into your data. But we have some advantage to use cloud computing. That's why we use Amazon cloud for this workshop every year. We received a grant from Amazon. So this workshop is supported by AWS Research Grant Award. And in this class, we give everybody a separate instance. So if you mess up your own instance, nobody will know. So we can just give you another one. And as I said, the better way to transfer large files to Amazon, is to make it free to upload data. And if you have huge amount of data, you can contact them directly. Sometimes you can even ship your hard drive to Amazon with FedEx or something. The next day, they plug in your hard drive and your data is ready to use. And a lot of data types already on AWS, Amazon cloud computing becomes more and more popular. So a lot of people are using and sharing data at AWS. For example, 1,000-genome data is already there. And some of the ICTCD data is already there. And there are many useful biopharmatic AMIs, AMIs, Amazon machine images. You can launch your instance based on these images. And all the biopharmatic tools are already installed for you. You can just start using Amazon Cloud. For example, they have CloudBioLinux as an AMI, and they have CloudMap, which is a Galaxy. Amazon Machine Image, you can just launch this one and ready to use at Galaxy at AWS. We talk about AWS, but Amazon is not the only one who provides cloud computing. We have other providers to choose. For example, Google, Microsoft are all providing cloud computing. So in this workshop, you will have some tools on your computer. You have tools on the web. You also have tools on the cloud. When you work with your data, you need to think about which working environment is better for you. We will help you to transfer among these various spaces. At the end of the workshop, you can make your own decision which platform you are going to use. And there are different ways to use Amazon Cloud. The general way is use command line, just like you use your powerful Unix box or HPC cluster. Some AMIs do provide a web interface. For example, Galaxy, you can use web browser to access the cloud computing platform. When we talk about big data, big data is really relative to terms. In 1956, this is the five megabytes hard drive looks like. And now we have an external hard drive with five terabytes storage capacity, which is one million times more than just this size.