 Hello everyone, my name is George Mihaespo, I'm a cloud architect with Interesit for Cancer Research. Okay, better. Okay, so my name is George Mihaespo, I'm a cloud architect with OICR, and I'm going to cover today how you can work in open source cloud environments like OpenStack, and especially in Cancer Genome Collaboratory, which is a cloud computing environment built especially for cancer research, so it's customized for this purpose. We will cover a little bit about computing technology evolution, some difference between classic high-performance computing environments versus the new cloud technologies. I'm going to cover OpenStack and its components, similarities with other public cloud solutions like Amazon and Google Cloud and Microsoft Azure. Then I'm going to cover the project's goal in history, the Cancer Genome Collaboratory project, and I'll also cover Docker containers and differences versus virtual machines. Computing evolution, a few years ago it used to take months or weeks for a developer or any computer user to have a new server ready for a project. They had to be ordered, delivered, had to be installed, racked, the IT team had to install the operating system, networking team had to connect the cable, security team had to secure it, and it was a very time consuming process. This process was reduced to maybe weeks or days depending how fast the IT team was moving, but still it was not as agile as developers and users wanted to have it. Then cloud came and somebody with a credit card could just create an account on Amazon and have a server up and running in mere minutes, so very, very fast. But still it was not good enough. Container technology was made popular and that means that starting a Docker container with application running is reduced to only seconds, so even faster provisioning time as well as lower overhead versus a virtual machine. And as the technology progresses, now Lambda functions or serverless is a new technology where you don't even have a container or a virtual machine up and running and waiting for requests. You just upload your Lambda functions and the cloud provider executes those functions when the user sends a request, sends a request. So you pay just for the CPU cycle used by your function and the time it took to execute. So for example, if you have, let's say, a web server, instead of having a web server running 24-7 for maybe five requests a day, you can have the request being served only when they come and you only pay for that the time it takes to execute those queries. Characteristics of high-performance computing, which probably most of you are used to using your organization's environment, basically they are shared, they are very fast, most of the time, depending on when they were deployed. They have a fixed configuration, meaning that they have the same operating system, they have the same CPU type, usually the same networking, usually they also have shared storage. So if the files are uploaded to the main node, they are available to all the compute nodes where your code is getting executed. But you also do not have root access on the worker nodes. Usually there are no containers allowed because it's not a very secure technology. And usually there is no access from the internet without going to the IT team and they have to open up the firewall to allow the traffic, which usually they don't want to do that. You also don't have a choice of the kernel version, libraries versions, operating system, and most of the time there is no cost recovery to the end users. So the end users sometimes leave their files sitting around because they don't pay for the storage. They leave maybe jobs running for a long time because they don't pay for the CPU. They don't use it, so it can be very efficient. In cloud computing, on the other hand, it's shared in the sense that there is a pool of hardware servers, of hardware gear that is shared between the users of the cloud environment, different projects, different users. It is flexible. The cloud provider can reconfigure the servers. And there is usually no shared storage. Shared storage means that if the shared storage is broken, everything is broken. In cloud computing, the idea is to have a very scalable architecture where there are no single points of failures. If you have a hardware failure, it's going to impact very limited part of the cloud. It's not going to be taken down everything. The users have root access in the virtual machines that they start, that they launch. They can also install Docker containers inside those virtual machines. They can choose whatever kernel version they want, whatever operating system version they want. So if they want to use CentOS or Debian or Ubuntu or anything else, they can do it. They can also control access to the virtual machines. So if they want to serve a web server on port 9000, they do that, open up the firewall. So it's a shared responsibility model where the cloud provider takes care of the APIs, protects them, protects users from attacking each other, and protects the infrastructure. But the users are responsible for protecting the virtual machines that they launch. So it's a shared responsibility model. It has a small failure domain, as I said. It's built with community hardware, which is cheap, but you can get a lot of it. So if you have a failure, which is unavoidable at scale, that failure is going to impact just the workloads that were running there, which will be just the frequency of impacting everybody. And there is cost recovery. So users are impacted by their actions, and they have to be conscious about the resources they use, clean up after themselves, and shut down or terminate instances that are not longer in use, delete the files. Some of the, probably you heard of, these very well-known cloud technology providers, like Amazon Web Services, which is the largest in the world right now. They started eight, nine years ago. Then Google, the second place, probably, with Microsoft Azure. But on the private side of the cloud technology, there are a few projects that offered open source versions of cloud computing. Cloud Stack, Open Nebula, Open Stack. Open Stack seems to be now the main leader in open source public and private cloud computing. So if you want to deploy cloud computing, and you don't want to go and pay Amazon or Microsoft or AWS, you want to do it yourself, Open Stack is the main choice you have without having to pay anything. Some of the very large users that are publicly using Open Stack for their environments, AT&T, Verizon, Comcast, some of them are, like Walmart, Best Buy, they decided to build their own cloud computing environments because they didn't want to pay their competitor. Amazon basically competes with Best Buy and competes with Walmart, so they didn't want to give them IT dollars, and Amazon basically to compete with them. So plus, they have much more flexibility in building their own environments. They have the freedom to innovate, and they cut their costs with external IT providers. Yes, there is no cost. Of course, you can buy Open Stack support from companies that offer like Red Hat, Canonical, IBM, but like we did it, you can deploy it yourself, download the software, install it, and if you want to change it, you have access to the open source, it's written in Python, you can customize it, so it's a very viable alternative to going and paying an external provider. And Open Stack is used heavily in research, so CERN, the Physics Institute in Switzerland, has 190,000 cores using Open Stack, and in the next six months, they are deploying another 100,000 cores. So very, very large environments that use Open Stack. Decay is that the researchers in Germany is an Open Stack user. They have a large cluster. EBI in London, Open Stack user, ETRI in Korea, YCR has two large Open Stack clusters. Many others, they decided to use Open Stack for their research environments. Most of them in parallel with HPC, so HPC is still a viable alternative, but as users demand more freedom in having root access and using other type of operating systems, they decided to offer a cloud environment with self-service and cost recovery sometimes, depends how you deploy. So you will probably, in your careers, as you get to work in different places, chances that you'll find an Open Stack cloud environment pretty high. So it's a good technology to learn. And as you will see in the lab, it's not very different than Amazon or Google. The concepts are the same. What is different is, of course, the user interface and the command line. But it's not very different in terms of concept. I'm going to cover a little bit about Open Stack. Again, as I said, it's a free and open software platform. It was developed initially by NASA and Rockspace. In 2010, NASA was developing a solution for storing their large image library. And Rockspace, which is a cloud data center provider, was back then in the US, was developing a technology to start virtual machines, to orchestrate them, self-service for the users. So they met and they said, OK, we have this project. We have progress already doing this. And you have the other one. We want what you have. You want what we have. So let's make an open source project and put the code together. And then because it's written in Python, it was very easy for a lot of other developers to join the project and to start committing code. And it has a recycle of six months, which means that every six months, there will be changes. It's a bit hard for the end users to upgrade because once you have it in production, upgrading it can be difficult. So if you are not upgrading on the other hand, you'll be six months a year, a year and a half farther from the current version, so even harder. So it's not easy for the IT teams to actually deploy it and make it work, but the alternative is to basically send the IT budget directly to Amazon and do not have any IT in-house. And if Amazon decides to increase the prices or do things differently, you are basically at their mercy. So it's a very interesting technology for an IT department to work with. It's challenging, but it's very interesting because you can look behind the scenes and see how it is built. You can see it's inner workings and can make changes. It's fun to work with. It started with just a few projects providing virtual machines, which is called compute basically, networking, like a virtual private cloud in Amazon where you can have your own networks and IP addresses and everything else, block storage and so on. Now there are more than 19 projects that OpenSack has trying to mimic what Amazon and the other public cloud providers offer. So Amazon has probably 100 services from everything as a service basically. OpenSack is not that advanced but has the core services that allow other developers and other companies to come and to add new platform as a service offerings. As I said, it's a complete visibility system. This is a diagram, which is a simple diagram with just a few services showing how the user can access the cloud environment to the internet and can either hit the dashboard, which is a portal similar to AWS console and use the browser to launch virtual machines, create networks, do everything they have to do. But they can also talk directly to the APIs that the portal talks to when executes user requests. So you can have an OpenSack client with credential set and you can start VMs without having to use a browser ever. And you can also, yes. Can you define API? API is basically, in the case of OpenSack is a web service that receives restful calls like get, post requests and translates those HTTP requests into actions that happen in the back end. Like asking someone else a server if you look for it and send it back? Yes. And it allows, once you basically have this API, the end user can build other applications that you can have your own dashboard. You can have a web server that has different pictures and different buttons and different... When you click on them, they make a get request to this address with this content. So that's why all the companies now say, hey, we provide open APIs or we give you access to our APIs. It means that they let the end user to innovate even more. Like Twitter has an API and that allows users to get in bulk tweets without having to scroll to all the tweets feed. You can make an API call, give me 100,000 tweets that have this keyword and get back a very large chunk of JSON that you can then filter and create the report and so on. So that's the benefit of APIs. It allows more innovation than the company who has the API can... They want to create an ecosystem. They want to basically get more features from the end users, evolving them into the development of new features. OpenStack services have a similar structure. Yes. I'm not sure I heard exactly the question. OpenStack is not a business. It's an open source foundation. So they don't have a business model company. There are companies who offer OpenStack-based services. Like Red Hat, for example, has an OpenStack distribution where you can go and buy support from them and they will install it for you and you use it and if you have a problem, you call them and they fix it for you. But I guess he's asking, like, then why do they get into all this trouble of hiring people, paying them to create all this? If everything was for free and probably... I guess you pay afterwards, right? When you create your own instance and stuff and you have a lot of life data, you pay someone. Yes. That's where they... The end users pay us, exactly. Yes. But the software itself, so for a research institute when they want to deploy OpenStack, they don't pay anything. They go and buy the hardware, they install Ubuntu, they install OpenStack and then they start using it. If they want to charge their local users for the resources that they use, they can do it if they don't because they have another model of financing in the IT department to grants or otherwise, then they don't. But OpenStack is not a software that you have to pay for. It's not like Microsoft Office or others like Salesforce or software that is per pay. This is the OpenStack dashboard. It's a very simple reference implementation of a portal that comes with OpenStack and is used by most users of OpenStack. But as I said, companies who build like Rackspace has a cloud environment where they offer virtual machines and other services like Amazon. It's based on OpenStack. Their portal looks different because they customize it. But if you install OpenStack, you can install their base implementation and you have a portal ready to go. If you want to work with it, if you want to have it more complex and do more things, you have to do it yourself, but it comes with ready-to-use dashboard. OpenStack Cinder is the project that offers and users block volumes. Basically, if you are familiar with Elastic Block Service in Amazon where you can create a volume, attach it to an instance and have more space for that virtual machine, and if the virtual machine dies, the data that you're storing on the volume survives is permanent, is the same concept in OpenStack. In virtual machine, you want it to have more space and create a volume of 100 gigabytes. You attach it to your virtual machine, you log in, you see that it has a row disk attached to the SDB. You can format it, partition it, whatever, mount it, and you can start using it. You put your data there. Tomorrow, something happens with the server where the virtual machine was running. You lost your virtual machine. The data that you have on the volume survives. Create another virtual machine, attach the same volume. This time, you don't have to format it and mount it or anything else because it has a file system with your data on it. You are good to go. OpenStack Neutron is the project that offers software-defined networking allowing you to create complex networking scenarios. There are plugins for many commercial networking vendors like Cisco, Juniper. All the vendors decided to provide a plugin that allows their devices to be controlled by OpenStack. Why? Because Amazon and Google and Microsoft, they are not going and buying hardware from Dell and from IBM and from HP. They go directly to factories in Asia and they are building custom servers. The same way they don't go to Cisco and Juniper and Brocade and all the other networking vendors to buy switches and routers and firewalls, they go to Asia and they build custom. All these manufacturers decided to get involved in OpenStack in order to give the companies an option to build their own cloud environments while still buying hardware from them. Best Buy basically now can still go to Dell and buy servers and install OpenStack and control the servers easily with OpenStack and still can go to their vendor Cisco and buy switches and control them with OpenStack. The entire industry basically rallied trying to offer a free software solution for building cloud environments. So not all the IT budgets go to 3 for public cloud providers and this also gives a negotiating tool for end users. Still, they go to AWS, but they also have an in-house environment. They can go to AWS only when they have to burst. Best Buy has a sales day like Black Friday and their capacity in-house is not enough to serve all the traffic. They have servers in Amazon, but that's just for a few days. Most of the year, they serve the traffic from the in-house environment. And the same with Wal-Mart and everybody else. If they didn't have this, they would have to pay all year-long to the public cloud providers. And the same with research organizations. If they have requests to do an analysis, they don't have enough capacity, sure, they can go to Amazon and pay as you go. And it's a very good solution if you have the money and it's a one-time. But what if you have to do this more and more often? Sometimes, one day you'll say, it's better to actually let's do this ourselves. So we don't have to pay so much every time because it's cheap if you look at one hour, but if you multiply by 24 hours 365 days, the cost for a server is many more times that it costs you to do it yourself. So as I said, cloud has extended functionality like software-defined networking, block storage, object storage, draw box and Google Drive where you can upload your files and download them over HTTPS. Cloud init is an invention, I guess, of AWS because they started the entire cloud computing revolution and it's a package, a software package that comes pre-installed on most Linux distributions. What it does is when the virtual machine starts for the first time, it recides the file system of the virtual machine to use the entire space of the disk that the virtual machine has. The VM could have a disk of 20 gigabytes, but the image that it starts from could be very small, like 200 megabytes. So when the user logs in for the first time, he's going to see a disk of 20 gigabytes already. Even though the image that he used to start that virtual machine from was very small. So this happens before the user logs in for the first time. Also, the cloud init packages downloads the SSH key that the user chose when he started the virtual machine and injects it in the authorized files of that user in the file system. So when the user SSH is, he is allowed to log in. It can also download the file that the user provides and execute it. We'll see this later in the lab where we can basically create a script with instructions of what should happen and when you launch the instance, the virtual machine, you give it this script and you don't even have to log into the virtual machine. Because the virtual machine starts, downloads that script, executes it, which could be a number of steps, like downloads its file, runs its analysis, uploads its results. So it lets you basically do a hands-off execution without having to SSH and to do this manually on each server. That link provides more information about how you can use it, what it can do. It's a very powerful tool and it's available in Amazon and in Google and OpenStack, of course. Cancer General Collaboratories, I'm going to go now into the part where we discuss about this environment, which I design and currently manage at YCR. And the project goal, it's history, progress and current capacity. It's a cloud computing environment built for biomedical research, for cancer research. It was based on a four-year grant awarded to build this large scale cloud infrastructure in order to store the data sets generated by the International Cancer Genome Consortium. It had multiple cores. Core one was building the infrastructure. Core two was research. Core three was outreach, so multiple cores. I'm working on the core one, which is building the infrastructure. And the goal was to have 3,000 cores of compute capacity and 15 petabytes of storage. And we decided to build it with OpenStore, so we use the money from the grant for capacity and not for paying licenses and software and support and so on. And we built it with OpenStack for the cloud orchestration. Only built it with CEP, which is another OpenStore software for storing very large data sets. Before I started working on Collaboratory, I had about six months where I was part of a group that was doing BAM alignment and later VCF calling. So even though my training is in IT, not in bioinformatics, I managed a few cloud environments, some of them public cloud and some of them based on OpenStack. So I saw during the analysis how the resources were used and the type of workflows and the type of workload that are used to do research. So then when I started the design, I knew how to design it based on this type of analysis. So in the whole genome sequencing, at least users downloaded very large files. We had the largest one was 800 gigabytes. Two more files with very high coverage. But most of them are between 150 to 300 gigabytes. And you need two files at least, a normal and a tumor or multiple tumors depending. The workflows, if it was a BWMM alignment, could take seven days depending on how many cores the VM had, how much memory, disk access. And the data generated after alignment, it's the same as big as the input data, usually a little bit larger, but let's consider it the same. So you generate very large data sets as well. For VCF calling, the data sets generated are much smaller, five, 10 gigabytes, just the mutations. Based on our experience working in public cloud providers like Amazon and OpenStack based like research environments, it's better to design the workflows so it's small, it's modular. And if you have a failure, it only impacts one analysis. We had in Amazon where we had four VMs running in a cluster and after four days, 80% done, the workflow would crash, lose everything, you lost four days of very large VMs being compute, everything is gone. If we had split this in individual, it would take probably 10 days for each to finish, but if you lose one after so many days, you only lost one, so smaller failure domains. Also, we decided at the end to package the workflows as Docker containers, and this allowed us to have reproducible workflows. So you run a Docker container in DKZ or we run it in Amazon, we run it in ETI, it was running the same, even if the OpenStack, the VMs was Ubuntu versus Red Hat or different versions. So containerized applications in containers helps a lot with portability and reproducibility. This is the CPU usage of the, I think it's a VCF calling, you can see, maybe you cannot see it there, but it's like 12 days usage. The blue line shows the CPU usage. At the beginning of the workflow, you can see the CPU a little bit, and then the first phase is 100%, and then probably some merging, happening and then again dropping, but out of the 12 days, probably 80% of the time CPU was saturated. That also means that you cannot oversubscribe resources in a cloud computing designed for cancer research because of the workloads. If we were to run like web servers that use 10% of the CPU on one physical core of the server, you can run 10 VMs with one core because each VM just uses 10% most of the time, so you run 10 of them, you get 100% utilization. In a cloud computing build for cancer research, if you have a server that has 40 cores, you can only provision VMs that use probably 38 cores. You leave two cores for the server itself and 38 cores. If those VMs do something like this, they will saturate the CPUs that is provided to them. Oversubscription is out of the question for cancer research. Other workloads, like most of the web workloads, they are good with oversubscription. People do it at 16 to 1. You have one physical core, you provide 16 virtual cores because they are not using very little bit of the CPU most of the time, but not in cancer research. Memory usage. This VM had 48 gigs of RAM. As you can see, most of the time it used probably 16 gigs, but there are steps in the workflow where it used up to half. That means that, again, you cannot oversubscribe memory because there are steps in the workflow that will fail if you don't give it enough memory. Another good lesson, do not oversubscribe memory either. Disk usage shows that at the beginning of the workflow, the disk is saturated because it has to download the data and there is a right of right activity to the disks. But after that, and also in the merging phases, a lot of disk activity, after that, there is very little bit, very little IO activity reading and writing from the disk. That means that if you schedule your workflows with a little bit of delay in between, they will not hit the same step where they download and write at the same time. So you can have like a half an hour spread. So the physical disks that have to execute the reading and writing requests from the virtual machines will basically do a much better job if they don't have to do it in the same time for everybody. And this space usage, we saw that this is the available disk free space, so it drops at the beginning when the data is injected, it's from the repository. After that, there's another drop when there are some temporary files created which are then deleted and free space goes up again. So by monitoring what happens with the virtual machines during the workflows, allowed us to decide when we are going to build it, what to focus on and how to do it. Collaboratory data, if you are familiar with the ICGC data portal, this is the ICGC.org. It is built by the bioinformatics team of which I'm a part of. My colleagues did a great job. They also indexed nightly the data stored in Amazon in Collaboratory in other repositories. So you can go to the DCC portal, you can filter and query the data that you are interested in and you can generate a manifest file containing object IDs of the files that you want to work with and then you can, we'll see later, use that manifest file into your client and download all the files and use them. We built this custom object storage, developed at YCR. It uses the S3 API because you store most of the data sets from ICGC in Amazon and in Collaboratory. So we wanted to do the work only once, so we built a client because the object storage in Collaboratory is S3 compliant. So the object store from Amazon is similar to the one in Collaboratory. We are able to use the same client against both providers, Collaboratory and Amazon. It has some very interesting features where it can resume downloads and uploads. So if you download the file, it's 100 gigs and halfway through your VM crashes or something happens, networking drops, you can restart or resume the download without having to download 50 gigs again. This is where it's left and continues from there. It also does bam slicing. You can tell your client, I don't want to download this entire whole genome. I want to download just, give me a slice from chromosome 20 from this position to this position. And because the client talks to the metadata server behind the scenes, the metadata server knows that slice in which object is and you download just that object that contains that slice. You don't download all the objects for the entire 100 gig file. It also has the support for file system in user space so you can give it a manifest file and you can mount all those files in the file system very fast. You don't actually have the files. You see them in directory where you mounted. You see them, you can list them but you actually want to read from them. That's when it goes and pulls the files. But if you just want to navigate the structure of the files, you can do it much faster than downloading 100 files sorting to them just to find out that you maybe don't need all of them. And I would say that the client is very fast. In Collaboratory, it takes about 15 minutes to download the 100 gig file because the self cluster and the open stack cluster are in the same physical space connected by 10 gig networking. Download speed is very fast and that allows you to basically save the time for analysis. For a workflow that takes a week to execute, 15 minutes is not a big wasted time to basically prepare the data. In the last two years, we had almost 22,000 instances. We have more than 80 users. This slide is a little bit old. 34 research labs. We have from Australia, from Europe, Israel, US, Canada. We have more than 650 terabytes of protected data sets. So I'm talking whole genomes, mutation calling from the three pipelines that the PICOG pan-cancer whole genome analysis project. There's a pipeline from abroad. One from Sanger and one from EMBLD that analyzed the data generated by ICGC. So all this data, which is protected data is stored in Collaboratory. PI can open up an account with Collaboratory and roll its students and they have access to the open stack over the internet. They create virtual machines. They download the data. They do the analysis without having to deploy the infrastructure, download the data sets over the internet to their environments, wait and wait and wait. What we have right now in Collaboratory is almost 2,600 CPUs. We still have money left in the budget to acquire more hardware. The compute nodes have a total of 18 terabytes of RAM. So, two 18-factor bytes of local disk, 7.6 petabytes of object storage which is separate from the compute local storage with 10 gigabytes in a 2nd networking in between. So it's a very modern hardware, very fast and efficient. We also developed the bioinformatics department developed an in-house reporting app and this allows the PIs who have accounts here to log in and to see usage and cost per day, per user, per type of service. So they can see if a student left the VM running, maybe the student left the lab or why is this VM that costs like $3 a day with 16 cores and so much RAM for the last two weeks. Oh, I forgot about it. Okay. So it's a daily update showing usage. Do you have any questions so far? I'm going pretty fast. A lot of slides. No questions? Okay. Yes. I'm not familiar with the compute node. It's probably an HPC cluster. It's totally different. In HPC cluster, you use your SSH into a head node and you have a job scheduler. You push a job into the Q and it gets pulled by all the workers and executed. I have a shared file system usually and you don't have root access. You cannot change the libraries. You cannot... In OpenStack, you provision your VMs. You don't have a job scheduler. You are the job scheduler. You can build a job scheduler but it's not there. OpenStack has its own scheduler but that is used by the cloud to decide where your VMs will be running. So it knows the capacity available in the cloud. So if you want a VM with 24 cores, it finds a server that has 24 cores available so tells that node hey, start this virtual machine. So that's the internal OpenStack scheduler. If you want to send a script just to be executed for the OpenStack and Cloud and Amazon, that's for you. Sure, you can start 20 virtual machines. You can install Sungrid engine and you have a job scheduler and you have a master node and 19 worker nodes and you can send your script to that queue. But it's a layer on top of the infrastructure made by OpenStack. It's basically you have to deploy your own HPC cluster on top of virtual machines and you have users who have done that. It's not very efficient unfortunately because again you have single points of failure. If the VM where your master node is running has a problem, your entire HPC cluster virtualized is going to be impacted. If all the 19 servers are reading from just that one VM you have again a bottleneck. It's not very scalable. Maybe it works for a cluster with 20 nodes. Does it work for a cluster with 15 nodes? Probably not. The more you scale it out, the more slower it becomes and vulnerable to single failures. Docker is an open source project that automates the deployment of in-applications inside the software containers. It wraps all the libraries, the operating system, the container that's based on the Ubuntu and you run it on top of Red Hat operating system. You can run it on top of Windows even. So it's totally containerized application with libraries with everything it needs to run and it's one package. And you can pull the Docker container and run it the same on different containers. It's very useful for developers because in the past there was a problem with I developed this software worth of my computer why when you deploy it in production, it's not working. If you give the IT team that does the provision in a Docker container and you say it's working, it's going to work in production as well. It's totally containerized. So for the software they are all in there. Yes. Yes. That's one of the biggest selling points of Docker and everybody is moving towards Docker. Instead of having Debian packages or RPMs or turbos, you get into conflicts with other dependencies on the system. Here you start with the container and then you go inside the container and you install what you need and then create a container and it's very easy to trace the file system layers, the changes that happen. So you can have multiple versions of containers, one container that diverged from this version to the other version just because you upgraded one package inside. It's a company that's made popular the container technology. It's not a new technology. It was developed by IBM 40 years ago but they came up with easy to use APIs and tool sets which makes it very popular right now. There are other container technologies other than Docker but so far they seem to be leading solutions. There is no cost to use it. They make money providing other solutions like enterprise versions but just for the open source version works very well for everybody. Some of the benefits first of all, the contents are much more than full VMs. So if a full VM has hundreds megabytes, gigabytes Docker container can be very small. It runs faster so the start up time for a container is seconds for virtual machines. The virtual machine still has to boot so it has to detect the hardware even if it's virtualized hardware virtualized CPU on disk and memory it still has to detect them to go to the boot process. Docker container doesn't have to do that. It runs on top of the kernel of the operating system where the Docker executes. So it's much faster start up. It runs closer to bare metal because it's one layer less of overhead from the operating system if you run Docker contents on bare metal so you have a server install Ubuntu and then run Docker containers you'll get much better but better performance than if you had a physical server running Ubuntu and you stole a VM run a VM running Ubuntu. Because the Docker containers run as root if the application or the user who controls the Docker container escapes the jail it's easier to attack the host as well as other Docker containers so the security for Docker containers is not very good so far and that's why you cannot run Docker containers in HPC environments because the IT admin doesn't want to have user A running containers and potentially impacting user B or somebody else even the host itself the server that is managed by IT what usually Docker users do they start virtual machines and they run Docker containers and they have their Docker containers running the same VM so if something happens it's their kits that attack each other so you don't share server or virtual machine running Docker containers from multiple users you still get the benefits of containerization portability and everything else without or minimizing the risk of running containers with untrusted users this is what I'm showing the difference between so in a physical server we have the black layer which is the hardware itself with the CPU, memory, everything else you have the operating system and then you have a hypervisor then you have virtual machines and they also have their own operating system could be different than the host itself and they also have binaries and libraries and then on top of that they have the application so a lot of overhead in containers you have of course you have to start with the physical layer you can abstract that and you have the operating system could be Ubuntu or Red Hat and then on top of that you have binaries and libraries that are different of those on the host and as you can you have the application so the overhead is much smaller and it's faster to boot and it takes less space the only problem is security those applications those containers unless they are all your containers and you trust them it's not good to run containers containers with other people on the same host so I split the lecture into two parts part 1 and part 2, I wasn't sure how fast I'll go to part 1 do you think I went too fast I have questions can you continue part 2 good so I'll talk a little bit about the ICGC data sets how you search on the collaboratory on the portal for the collaborative data how to access it then how you customize a virtual machine with your software and you snapshot it basically freeze those changes into a virtual machine that you can use it to later start more virtual machines pre-configure with your changes already and then we'll go to a scale out planning exercise that will be pretty interesting if we get there so ICGC not sure if most of you know what ICGC is the consortium started about 8 years ago actually 10 years ago in 2008 so it's a 10 year project the goal was to collect 502 more and matched normal DNA samples for common types of cancer so 25,000 donors and 50,000 genomes each of the 50 projects cost 20 million dollars to fund there are about 17 countries who committed funds and for each type of cancer they wanted to have at least 2 countries or 2 projects covering that type of cancer because in cancer types are different they wanted to see liver cancer from Germany similar to liver cancer in US for example and skin cancer in Australia is different than skin cancer covered by Japan and also they wanted to see if once they put all this data together what are the common mutations between types of cancer maybe brain cancer in UK has mutations they show up on breast cancer in Korea so the more data they have the easier it is to solve this gigantic puzzle that's cancer from all this 25,000 genomes took 10 years for data to be collected and analyzed 28 134 donors had whole genome sampled that means much more data was generated for this data sample after it was sequenced the genomes were aligned using the same algorithm against the same reference genome because initially each lab aligned it with their own libraries and against different genome and the results were not uniform so you can't compare or draw conclusions based on different so they took all the data and they re-aligned it with just one algorithm BWMM they froze the version that they used and they used the same reference genome after that three research groups the Broad Institute decares that partnered with EMBL and Sanger Institute in UK developed their own workflows and they took 63 whole genomes from this list and they send them to the web lab and they find the mutations manually and then they compared with what the three algorithms found basically the goal is to find and to develop a goal standard in mutations calling that is as accurate as possible as fast as possible and as economic as possible in terms of resources that is needed so of the three algorithms of course one is better in part of the mutation calling another one is faster so the goal in the end after the papers are published to maybe merge part of the algorithms and to make this goal mutation call or to decide which one of them is recommended to use going forward a typical workflow was you take the two more sequence you align it and then you run different colors and then you do the same for the normal so after mutation after alignment basically the two samples where we had patients who had multiple cancer and we had two more samples for more than one yes after this benchmark but then downstream for the variant quality and for the variant in the lab you mean? yeah so they use these three workflows yeah the validate so the lab the results from the lab was considered to be the goal that's the correct one and then let's look at bioinformatics workflows what did they find which one is more accurate you cannot do the what lab for very large data sets because it was cost a lot and very time consuming but they selected these 63 tumors that have very good coverage and they said ok we'll do it in the lab to find exactly where the mutations are right but apparently one let's say of the 300 which mutations that the other one doesn't right? so it's also reported that like the intercept the union? I don't know exactly what they were deciding in the end there are working groups in ICGC that are focusing on different parts of the analysis and there are papers that are going to be published hopefully soon and we'll see when they come out with what the conclusion is some workflow might be better at detecting another one might be better at detecting SNVs I don't know but the idea was that to have something that the what lab confirms exists and to see which workflow is better at conforming those results and how many resources the workflow used like if you have a workflow that is very correct but takes three months to run if it's feasible or you want to have a workflow that takes a week to run is it good for clinic? the three months one the patient might be dead by then so you cannot wait maybe 85% accuracy but the shorter time is better than 92% accuracy and three months yes they compare different tools you mean to see which one gives better more accurate results or in higher workflow the workflow is what do you mean by saying workflow is it just the general steps or the actual tools yes so it's the general steps but also the tools in the workflow there are well-known steps that use tools like binary that exist already but also there are steps that they developed for this purpose like you have the first step where you download the data then you do sorting and then you do this and that and step five you run this binary that does something that binary doesn't exist or didn't exist before it's not published it's not open-sourced, broad created that okay they did the same for a few steps in the workflow also if you change the order you do splitting and then merging and then sorting you might get a different result maybe it runs faster maybe it's more efficient, maybe the end result is different if you have 63 results that you can compare with you'll see okay if we do it differently we get the correct result but we do it faster and less memory requirements or less CPU so it's very important when you plan to do this at scale to fine-tune the algorithms and the order and everything else so it's like a contest the best product the best tool wins so the BWA for example was taking five days times two because you have two specimens two more on normal and you needed about 140 gigabytes per donor okay so 120 times two and this is with a VM that had 8 cores 16 gigs of RAM and again we did this across 2400 samples 2400 donors in 14 cloud environments and we have VMs different structures like some of them had more memory per CPU some of them have less, more disk less disk we basically used the compute capacity that was made available by different organizations that were part of ICGC so DKFs that donated from their cluster so many cores and ETRI did the same and we had the HPC cluster in Barcelona and we had a budget for Amazon so we had to adapt to what we had available to work with and we had to do data management job management quality control it was probably a year of work to do end-to-end with again this is a slide from the Picox so we had access to 14 academic and commercial clouds with dedicated cloud that means use the metadata server that was keeping track of the progress and of the samples that were analyzed we were downloading the data from multiple data repositories around the world so they had a few genus sites that were called with software that allowed sequencing centers to upload files and then maybe we had the data in genus site in EBI in London but we didn't have any compute capacity available in London but we had compute available in DKFZ so we started VMs in DKFZ and they would download the data over the internet from EBI in London analyze it and then upload the aligned sequences to DKFZ and then we had to synchronize the repositories DKFZ back to London as you can imagine we are talking about 800 terabytes of data 14 cloud environments international project US, Canada Germany, everybody was changing this is the distribution of the compute resources so as you can see the C sites are compute centers the G ones are the genus repositories where the data was stored and S where we had S3 compatible data storage and again sometimes we had compute and genus repository in the same place so you can download pretty fast because if you have genus in EBI and you have computing EBI it's pretty fast to download but sometimes the computing EBI is already running 100% and you have data available in DKFZ we have to move data around not very efficient and that's why when everything was done we uploaded all the data in collaboratory so we have compute and storage in the same place so we don't have to if we have to do something like that we have a place where we can do it much much faster the data that is stored as I said the collaboratory AWS and genus are returned now is indexed in the portal we also partner for collaboratory with University of Chicago so the ICGC basically a subset of the ICGC called the TCGA data the US data cannot leave the US soil so we couldn't store it at collaboratory but it is stored at University of Chicago it's a protected data set but if you are a researcher who is approved to access this data set you can download them and use them but you cannot as a research organization that makes available the data available further you cannot store it outside of US soil but OICR and the collaboratory has a very fast connection to Chicago so if a researcher who is running compute in collaboratory can download data sets from TCGA stored in Chicago over this very fast network connection faster than the regular internet because Toronto and Chicago have fiber optic dedicated for research how you can search the data sets in ICGC you can do it by sample UID by type of disease by project type of data there are so many ways you can filter and search for data that you are interested in and at the end after you search and select the data that you want to look at you can either download the manifest file or if it's just a single file it's object UID and then you can use it with the download client to download the data the custom client as I said can allows you to request just a slice of the whole genome you can mount the single BAM or multiple BAMs into directory you can resume interupted downloads so it has a lot of features that are designed to help you work with large data sets you don't want to restart a download that is midway through and having to re-download 300GB and for a large download even if it's fast still it's going to be minutes or hours so the chances for interruptions to happen are quite high so that's why you need a client that can deal with these kind of interruptions protected data stored in collaboratory is so the ICC data set basically stored in collaboratory and Amazon Amazon has just a subset of collaboratory data there are some projects that didn't allow data to be stored in a commercial cloud like Amazon so we couldn't store them there but they allowed us to store the data in Canada in a research cloud environment that's open only to cancer researchers so we don't accept users to create accounts in collaboratory if they are not DAC approved the workflow to access the data is the same in collaboratory or Amazon you start the virtual machine you install this ICGC storage client this custom client that we developed and then you go to the DCC portal you have a login button if you have a DAC or an approved account you log in and then you can generate a token the token is good for Amazon or collaboratory you go back to your virtual machine where you install the client you put that token in your configuration file and then you tell your client download this file or this list of files the way it works the virtual machine the client talks to the S3 API gives it the token actually talks to the authentication server gives it the token the authentication server verifies that the token is valid it finds the S3 API the S3 API object that matches the UID that you gave it and generates temporary URLs and the client after that talks to the S3 API and downloads the files directly so it's not in the data path and the data path just at the beginning when it validates the token and asks for the temporary URLs the temporary URLs are good only for 24 hours so even if you forward those to somebody who is not yet approved the access to the files is limited to 24 hours and this way we don't give access to the entire bucket where all the files are we give access just to the files that that token should give access so it allows more granular access and more temporary limited access then if we just keep them in S3 and you give access to somebody give access to the entire bucket so in order to download the data you could in theory if you had a cluster that is always would you just do it by installing the client or connect them so if I get this right I will tell you how it works and maybe this answers your question first of all Amazon stores the data for free with a condition that is available only accessible only from easy to instances running in the region where it is stored so it's a win-win situation because we get to have this data store for free we don't have to pay Amazon a lot of money to store it and Amazon receives the payment for the easy to usage by the researchers who need access to the data so you cannot download the data from Amazon directly to your lab the same with Collaboratory you go to Collaboratory to start virtual machines you download the data on those virtual machines you don't download directly outside of Collaboratory with the client if you don't want to use the client you can tell the client just to generate the temporary URLs and it has a flag given temporary URLs and you get back a list of temporary URLs very long URLs you can do a WGIT or CURL or any other tool you want to download it the difference is that if you do it like that you are going to do usually serialized WGITs so instead of having a VM that has 8 cores and the ICGC client if you tell it to use 7 cores when parallel downloads it's going to get parts of the large file at the same time and then it's going to put them together and if it breaks halfway it knows how to continue from there if you do it WGIT it's not that efficient yes you can do it, maybe you have other tools so in the end if you look when the download happens there is a directory that has a long list of temporary URLs and the client basically looking at lists takes 7 at a time they should take another 7 if you tell it to use 7 processes you can access in Collaboratory not protected data, so maybe you have your own data sets they are not stored in Collaboratory data that you generated you can upload the data and use it, you have space to upload it and you have compute to analyze it you don't have to wait for how long it takes for your IT department to provision the resources needed for your analysis and you have access to very large protected data sets already stored in Collaboratory very fast accessible some of the recommendations we make to our users when they are working in a Collaboratory or Amazon or any other cloud environment are to use snapshots to freeze and reuse a good working version of your virtual machine this is not that important anymore with Docker containers because the container has everything but usually you start with the base image which is Ubuntu image and maybe you want to install Docker before you can use it so install Docker, maybe you make other changes and then you take a snapshot next time when you start your work start 20 virtual machines they already have Docker installed on your changes you just SSH pull your containers, your application, your workflow pull your data and start the execution because at scale failure is a given, you cannot avoid it and that's why Amazon has multiple regions and they tell you to scale horizontally your application because if you have a failure and failure is going to happen probably never happen to you virtual machine failure in Amazon because they have so many of the chances to happen to you but it happens to people who have enough VMs it happened to us I had VMs crash in Amazon and you lose everything basically if you didn't scale your application to deal with a failure so you have to treat your workloads as ephemeral use automation, configuration management to have a consistent application you don't want to have each workflow a little bit different than the other one use good monitoring tools to know what your VMs are doing are they stuck at a step and you don't know two weeks later you find out that it's stuck there and you wasted two weeks of work waiting for it to complete if you had a monitoring solution that has a graph of CPU usage you would see it's flat at 0% or 1% then it's doing nothing so you as an agent when you have one it's easy to do it daily but if you have 100 you need something that's like a monitoring solution that is an agent when the VM starts so next time you don't have to do the work again you have it already installed and there are some tools like SenseSuite one of them that's a new type of monitoring solution where the servers connect back to a monitoring server and they auto-register so you don't have to go to each of them to configure them as they come out they call back home and they pre-register so you want to as hands-free solution as possible I'm going to cover a little bit about the snapshot because it might be a new term for you is the snapshot thing is freezing the hard disk of a virtual machine with all the changes like you take an image it stays like that when you take the snapshot it is uploaded to the image repository in the cloud so next time when you start a virtual machine if you choose the snapshot to be the base image for your virtual machine you'll get the same virtual machine as when you took the snapshot so you take the time you start with a plane let's say you go into virtual machine you install your application, install docker you install some repositories you install some certificates you do whatever you have to do and then when you are happy with that virtual machine you take the snapshot that basically becomes your new gold image tomorrow if you want to have 30 VMs like that you start 30 based on that one and you have 30 exactly the same like twins there are some considerations when taking the snapshot you have to clean up confidential data because if you have in that virtual machine some private keys some tokens some credentials when you take the snapshot that base image is available to anybody in your project it's not public but it's available to any other users of the project that you are part of so John and Mary they are two students John creates a snapshot but forgets his credentials in there Mary starts an instance from the snapshot she had access to John's credentials maybe you want to share this snapshot with Joe, Joe is part of another project you want him to take your snapshot with your application to give it a try to validate it to work on it you clean it up before taking the snapshot of any confidential data that shouldn't be be part of the snapshot also very important so how you do that you basically remove the authorized keys that's important because if you don't remove this file then you can SSH theoretically in all the virtual machines started from that snapshot it's like you give somebody your house but you keep a set of keys so when they start VMs from that snapshot their key is going to be appended to this file but your key is still there if the security groups the firewall allows you to SSH you'll be able to SSH so it's polite basically not keep your keys in the image but also it means that once you delete this file you cannot SSH back so you do this the last step before taking the snapshot once you delete this file and you took the snapshot if you want to SSH again you won't be able you're locked out, install your house keep the image size small what happens is basically the virtual machine starts from a base image which is small you install your application which grows the image the VM disk and then maybe you download 50 gigs of reference data you run a test analysis you are happy with the result you delete the 50 gigs of data and then you take a snapshot and it takes forever why? it was a small image no, the disk on the hypervisor on the server where this VM is running when you put 50 gigs of data it ballooned when you delete the data the writing system doesn't know that you deleted the data inside the VM so when you take the snapshot your file that has to be uploaded to the image repository is going to have 50 gigs of empty space but it's still 50 gigs next time when you start an image or 50 images from that base image it's going to take longer because you have to wait for 50 gigs to be downloaded and you have to pay to bring an image, a snapshot which is 50 gigs so what you do you start an image install your software you download your reference data you run an analysis, you are happy you start a second instance you do just the application without reference data and that's the one in snapshot you don't snapshot an instance that you inflate it and then deflate it and you know what happens in the back end the user has to basically I do these steps and it works I just have to do it again without downloading the data you can take the snapshot before you download it you can take the snapshot before you download but without running those applications on the downloaded data you might not be sure that they work as you think it works if you run a history in bash you're going to tell you all the comments you're running there so you can easily just see what you did and run them again on a instance so why doesn't it recognize when you delete something why doesn't it check because the operating system where the virtual machine runs doesn't see inside the instance the disk of the instance of the virtual machine is just a file on the file system of the hypervisor and that file as you make changes grows when you delete files inside it doesn't shrink that's one of the things to keep in mind if you work in Collaboratory or another open start Amazon it takes time to take a snapshot again if the instance is large a snapshot thing because it shuts it down takes a snapshot and then has to upload this very large file to the image repository and the snapshot is the longer it takes not only to take the snapshot but also later when you provision new virtual machines based on that snapshot mistakes can be costly if you have a bug in your provisioning process and you start 10 instances and you are charged by the hour and you shut it down 10 minutes later you're going to pay for 10 hours because that's not as valid anymore because Amazon used to charge by the hour now they are charging by the minute I believe but before when people had CI CD pipelines which is continuous integration, continuous delivery then they start a fleet of servers and shut them down and do it again you can do this many times in an hour and you pay many hours if you do 10 instances every 10 minutes in an hour you have 60 hours of usage now with Amazon charging by the minute and Google also reduced their charging intervals in the collaboratory the way it charges by the minute we add the time we only round up the last hour so if you have a VM that runs for 30 minutes and one that runs for 20 minutes and one that runs for 10 minutes and then one that runs for 5 minutes you'll have 65 minutes you'll pay for 2 hours now you don't round up every time you run up only the total to 1 hour as I said you have to clean up the confidential data you have to plan your image if you have if the workflow needs a reference data set it's better to have it available on a web server in the same environment so maybe you provision a VM with a web server you upload your reference data and then you start 30 VMs virtual machines that need that reference data and they download it from the local web server it's much faster than having each of them the reference data and much cheaper it's good to update the security to run all the security updates when the virtual machine starts because the operating system could be not up to date and security updates are available almost daily so the good practice start after SSH first step update all the packages I think I have time for the scale out planning so basically let's say that you have to you created a workflow that performs mutation calling and you want to run it on 100 samples based on your previous tests it takes on a VM with 2 cores and 16 gigs of RAM about 24 hours to complete the workflow and you have an account with a cloud provider or a collaborator or somewhere else where you have a budget for 100 cores for 72 hours continuously being a collaborator for 100 dollars but you also have a CPU quota which means that you cannot just get 1000 cores for half a day and be done they give you a maximum of 100 cores so you have to basically organize your workflow and your data to fit in this time budget equation and then your data samples you have 100 samples but 85 of them are around 180 gigs but 15 of them are much larger 310 and your cloud provider or wherever you run it has a number of predefined hardware templates to choose from ok so the cloud is flexible but it's not I want half a core and 7 gigs of RAM or I want 3 cores and 15 you cannot just make your own because they have physical servers that have set number of cores in memory so they split it into equal chunks so they can basically like a puzzle, like a Lego make it use the entire like 3 cores and 100 gigs of RAM then they have a lot of cores available but no memory on that server not good so they have predefined templates and the same in co-laboratory have 1 core 8 gigs of RAM so on so these are your predefined hardware options so how do you organize your VMs to fit in this budget so you can have a VM or have a Python based web server that you load your 100 UIDs just the UIDs and the web server accepts requests on different paths small sample and large sample done and done when the VM starts they do a post request to this address just a post in HTTP what they get back is a UID of a sample the web server marks that UID as being in progress because now it knows that this VM is working on that sample the VM takes the UID downloads the data around the analysis that everything has to do at the end of the workflow the last step does a post request against the same server saying hey I'm done so the server moves that UID from in progress to finished so you as a bioinformatician can just monitor this web server and see how many samples are in progress when they go in progress are they staying there for too long so there is a problem with that VM that is working on that sample on day one you start 49 VMs with two cores see one small flavor you use maybe Ansible which is a tool for configuration management start the VM, the VMs go they do a post request, they get the samples they run the analysis at the end of day one you basically use 49 times 2 use 98 cores of your 100 cores quota and you completed 49 of the small samples ok second day you do 34 VMs and use 68 cores but you also start for C1 jumbo because this type of flavor has enough space to accommodate the large samples also have more cores at the end of day 2 you have 83 average samples and 4 large ones day 3 you do the same a few VMs of small flavor and a few jumbo ones at the end of the day 3 you still have a little bit of budget so you use 88 cores of day 3 of day 3 so if you had any VMs that failed day 1 or day 2 you can rerun them what it tells us is it's important to pick a job, do not just start one type of flavor for every type of job look at the data sample if the data sample is small start a small VM if the data sample is large start the large one just don't use the same size for everything that's going to save you a lot of money especially for large kids if you are talking about 1, 2, 3 maybe it doesn't make a difference in the future it's going to be about numbers more and more analysis very computational intensive, you cannot draw conclusions if you just find it once, you have to find it many times to confirm it's not a coincidence so it has to be a large scale analysis monitor the progress of your analysis and be ready to run stock or failed workflows you can just start the work something else and then come back and find out that half of them died because you had a problem with your web server or because you have to keep an eye on the workflows start small so you don't launch 100 VMs from the beginning that's why I said do the test, see how long it takes confirm everything is working as expected in your workflow and try to limit external dependencies so if you download the data from an FTP server in the other part of the world if that data is needed in step 3 of the workflow how about you download the data and put it on a server which is local so if that server in Germany is down you are not impacted or if the Ubuntu repository has a problem with that package if you take a snapshot everything is stored up and working then you know that you don't have any external dependencies that can break your analysis