 Hi everybody and welcome to the kind of bioinformatics workshop. My name is George Mihayscu. I'm a cloud architect in the bioinformatics department and I architected the the Collaboratory Environment which is cloud environment designed for bioinformatics and especially cancer genomics. So it has special tunings and design decisions that cater for this special kind of workload which is very large files, very long and high intensity, high CPU intensity workloads. So in this module which is module number three we'll understand what is the cloud status because cloud is a relatively new domain and it's still very actively changing. So what are the main providers? What is in the research area available? We'll talk about Docker, its benefits, the difference between Docker container and virtual machine for those of you who haven't used Docker yet, maybe you're more used to use virtual machines or HPC clusters and then how Docker is becoming a very important tool in bioinformatics and then we'll look at Docker which is a YCR project to create Docker hub like a registry for Docker containers for bioinformatics. If we look at the history of how IT was delivered, 34 years ago there was a mainframe and everybody had a terminal and they had to use the mainframe and have time slots and then they moved into individual servers, more distributed smaller workloads. But the idea is that it's a very dynamic and fast changing field. So if maybe two, three years ago people in software departments used to order a server for a project and the IT department had to buy the server, order it through a vendor, wait for the shipment, rack it in the data center, connect the cables, do all that. It could take months of planning and of waiting before you can actually do something useful with the server. With the advent of virtualization the process was much more streamlined but still the software developer or the project manager had to fill up some forms, ask the internal IT department to provision a virtual machine, wait for the networking team to configure the network, maybe security team to configure the security, so it was still taking days. And then cloud came in and you can sign up for a cloud account on one of the public cloud providers. You put your credit card and start a virtual machine which is available in a few minutes upon running and you use it for as long as you wish and then you terminate it, you are only charged for that time. Then relatively even like a year ago, Docker, which is a company that started and took an older Linux kernel functionality which is LXC containers but added an API and management interface that allows somebody to start Docker containers, which we'll talk about in a minute. But the idea is that you can start Docker container in a matter of seconds, so much, much faster. And because technology never stops, now AWS came up with a newer concept which is Lambda functions or serverless this code where you don't actually have access to a server, you don't have to deal with file systems, with disk space, with CPU cores, with monitoring, with any of this, you just upload your Lambda functions which could be in Python, Java, there are a few languages that are available and AWS basically executes that code for you and it charges you only for the number of requests that your functions send as well as the time that your code executed. So this is the latest innovation and probably we'll see it more and more used as it matures. Right now you can do everything with laptop functions, but there are workflows and different applications that can be converted from a regular application into a Lambda function. It can be very cost-efficient. If we look at the cloud technology providers, basically the market, there are three large providers. AWS probably has 80% of the market, Google cloud computing and Microsoft Azure also investing a lot of money in this field. On the private, there are a lot of smaller cloud providers, especially in US, but also in Europe, Rackspace is one of them, SoftLayer, which is a company that was bought by IBM, Dreamhost. In Europe, there are like 1920 providers just using OpenStack. So OpenStack, yes. What's the difference between public and private? Private is an environment that is built and is accessible just to that organization. Like OICR has its own private cloud. Basically, it's available just to OICR researchers, accessible only from inside the OICR network. It might or might not have a cost recovery feature and it's private. It's that organization's environment, as opposed to public where you have multi-tenancy, you charge for the service. It's a business, basically. Because it's private and public. Yes. Academic and commercial collaborator is going to be a public academic commercial. Yes. So there are others which are sort of each specific there. They're public in the sense that let's say if you're interested in likewise like the collaborator is going to have to know the data set, I know of a say microbiome, they have bacterial DNA sequences, data sets that are open to their students. But then once the students go, they can go back and use the resource still and it's at a university called Radle or something. So it's at a university setting. It's a cloud-like infrastructure. It's public, but it's specific to that kind of activity, so they're supporting people involved in microbiomes. Yeah. So the definition is a bit blurry because you can have a mix of multiple, you can have academic cloud that still charges money. So that's, is that a commercial in a sense? Maybe it doesn't make a problem. Yeah. So, but open stock basically started in 2010. Initially, it was a joint project between NASA and Rackspace, which is a large data center provider in the US. NASA was working on cloud software to store their images of the space and Rackspace was working on cloud software to manage virtual machines in their data centers. And they found about each other's projects and they said, okay, let's open source our code and this is how open stock was started. And there was a, after a few years, Rackspace released the control of the code to be created the public foundation and the foundation is, has gold members, platinum members, but also regular people who can be open stock foundation members, which I am one of them. Anybody can sign up and have to vote twice a year. And there is a, there are a number of members who are in the board of directors like technical chairs. There is a governance model. There are two conferences a year where a new release of open stock is ready. And the other interesting thing about open stock is that the early one, they decided to write the software in Python because it's a higher level language. It's very easy to troubleshoot and also very easy to learn. So they grew their developer base fast. In the last release, it's I think like 2000 people over the world who committed code into open stock. Most of them working for companies, but also just individuals who sign up the agreement to give the code changes for free and they can commit changes. It's developed on Python in GitHub. And very important basically, open stock innovates and tries to offer the same services that are available in Amazon. So open stock has the same building blocks. I will discuss about the parts of, the main parts of open stock in the next few slides, but it's very similar in terms of concepts and features with Amazon, Google Cloud Compute, and even Microsoft follows the same model. So the market, even if it's still new, starts to have some standards on the offerings of a cloud environment. We discussed about open stock, about the six months release cycle. Initially, it's focused on the core services, as I said, compute, meaning that it starts virtual machines, but also can manage physical servers. So open stock can deploy an operating system directly on the server. When you are done with that server, you terminate it. And basically, that server is reformatted and it is put back into a pool so somebody else can say, hey, give me a bare metal server instead of a virtual machine, they get the dedicated bare metal server with the operating system of choice already installed. Just going to take a little bit longer than spinning up a virtual machine, but open stock can do that as well depending on environment where it is installed. Networking, block storage, we'll talk about all these pieces in the next few slides. Currently, there are more than 19 projects that have all kinds of services like telemetry, so metrics for the cloud used for billing, orchestration. If you are familiar with the cloud formation from Amazon, database as a service, messaging as a service, logging as a service, everything as a service, basically. This is a diagram of how open stock looks like, but this is actually a small piece of its architecture. Not sure if you can see very well in the diagram, but basically the users come usually over the internet and they connect either to the dashboard, which is a web user interface, or they talk to the API of the services. The NOVA API is the API service that takes requests from the users and executes them, talks to the compute nodes to tell them to start virtual machines or to terminate them. Neutron is the API for the networking service, takes the requests of the users to create virtual networks, start DACP servers, allocate IP addresses, all that work that your network admin has to do. Through a simple API call or a click on the dashboard, open stock is going to do it for you, so you don't have to talk to your networking team, to your sysadmin team, to your security team. The developer has the power to basically do all this very easily. Still, he has to know some networking, he has to know some security, he has to know some system administration, but he doesn't have to wait for anybody, so you basically get the power as a software developer or regular user to do all this, and you can script it, and it's much, much easier and faster than how it was done before. Some of the design tenants of open stock are that has to be scalable, no single points of failure and highly available, which means that all the services have to provide the restful API. They talk between each other either through their own API services or through a messaging queue, they use only when it's needed a database for keeping state. The database of choice in open stock right now is MySQL, which has an active, active option, so usually you run three MySQL servers, all of them are active, so in case any of the servers that holds the database fails, your cloud is not affected, or you can do maintenance, taking one server down at a time, so it is designed basically to allow live maintenance with minimal user impact. This is a screenshot of the dashboard, which is the project name, so open stock has project name for all its pieces, so Nova is the project code for compute, Neutron is the project code for networking, Horizon is the project code for dashboard, so if you hear about Horizon or dashboard it's basically the same thing. Dashboard is a web-based application written based on Django, a Python framework and highly customizable, you can add new panels, more functionality, the source code is public, it's on GitHub, you have basically the freedom to innovate and many private cloud environments and academic institutions take open stock and then they customize it for their needs, without paying any licensing fees. If we look at open stock, Nova is basically a cloud computing fabric controller, basically it can control the compute part. As I said, it can deal with bare metal servers, it can control Docker engine and start Docker containers, it can deal with the Zen hypervisors, QM and UML are not very used, Hyper-V, the hypervisor from Microsoft, KVM, the open source Linux kernel hypervisor, VMware vCenter, so all the major hypervisors open source and commercial are supported, so the companies who build like VMware and Microsoft, they actually wanted to provide a driver to open stock so people can use their paid licensing hypervisors with open stock, otherwise they would have been left out of this market and they had to basically be open stock foundation members and support the code and improve it. Open Stock Cinder is the name code for the block storage. If you are familiar with Amazon and EBS, Elastic Block Storage, Cinder is the same thing. Imagine that you have a virtual machine and that virtual machine runs on a physical server and that physical server crashes because when you run at scale, failure is a fact of life, so the more scale you have, the more failures, so usually you start your virtual machines but you either attach a volume which is a space on a physical server other than where the VM runs and you keep your data on that volume. If the server, the virtual machine is unavailable because the physical node where it was running crashes, your volume is fine. You create another VM, attach that volume back and you are back in business. It's like an external hard drive that you attach to your laptop. If your laptop takes fire, the hard drive is fine. Move it to another server. You can also grow it so it allows a virtual machine to expand its block storage without having to resize the entire VM. So let's say you start your workload with four core VMs that has 200 gigs of disk and then a month later you need more space. What are you going to do? You can start a new instance with a larger size, move all your applications over there or you can just create a cinder volume, attach it to your instance and now you have three extra terabytes available. All you have to do to use it is if you want you can create multiple partitions, format it with the file system of choice, mount it and copy your data and it's good to go. You don't need an instance. You just detach the volume, terminate the instance and your volume with your data is still there. OpenStack Neutron is the software-defined networking solution for OpenStack. It allows the users to create complex networking scenarios. So usually in enterprise applications, a very common design pattern is a 3-tier architecture where you have usually a web server at the front, the middle layer and then a database backend and these three layers usually are separated by different networks for security reasons. So your database, only your middle layer can talk to your database and only your front end can talk to your middle layer. You can do the same with OpenStack. You can create three networks and attach your virtual machines when you start them to different networks. You can choose to have the same IP addresses that you have maybe in your local environment. So you can easily replicate exactly the same environment as you have in your local data center. You can do it in public cloud or in this case it could be a public cloud provider based on OpenStack. So Neutron is the project that deals with creating virtual networks subnets and networks and virtual routers and the entire networking plumbing needed to have functional environment. Some of the OpenStack very large users, so of course telecom companies in US and Europe, AT&T Verizon, Deutsche Telecom, Walmart served about two billion pages on Black Friday about two years ago from web servers running in their OpenStack-based environment. Best Buy runs their web apps on OpenStack, eBay, PayPal, Sony. The companies are too many to list. Most of the companies have a mix of technologies and OpenStack is one of them and it starts to be a larger and larger portion of their, they migrate more and more of their workloads into OpenStack. The physics research institute in Switzerland has about 15,000 physical servers running OpenStack. Imagine just the cost savings of not having to pay the licensing fees for VMware or Microsoft or exactly. In research, YCR, we have two OpenStack-based environments. One is internal right here in the building used by the staff at YCR as well as the collaborator which is, I'm going to talk a little bit about the collaborator here, so the collaborator is started as an entire funded grant to build a cloud environment that is going to hold very large cancer genomics data sets that were collected by the ICGC project as well as about 3,000 compute cores and this cloud environment is going to be used by researchers in Canada and all over the world. Not Ontario? I think Ontario has anyway by the government. Okay, yeah, so it's government of federal government and we are in the beta state right now which means that we are up and functional. We have been for about a year but we only accept limiting number of researchers who have accounts. Yes, and we have about 500 terabytes of data which is indexed and available on the DCC portal and people can submit to get an account in the collaboratory and basically they have a quota of resources that they cannot have access to like CPU cores and disk and memory and the data is in the same environment so it's very easy for them to download the data to their virtual machines and do analysis instead of having to copy the data over the internet. It might take them months to do that and it's basically a model where you have in the same place the compute capacity and the data needed for analysis. Decay of that the German Cancer Research Institute is a large open-stack based cloud environment ETI in Korea, the London Bioinformatics Institute. Pretty much every research institute not necessarily cancer but all the academic institutions have open-stack based cloud so as you maybe will work in multiple research institutes you will probably use the same tools and the same skillset that you learned here. It's going to be useful. Some of the extended functionality that cloud offers is as I said the software defined networking allows the user to create virtual networks, block storage, extender, block storage outside of the virtual machines characteristics, object storage in other very useful features if you are familiar with Amazon S3 basically allows you to like a draw box allows you to upload files over HTTP and access them over this kind of interface and it allows some very large files. This is how we store the files in Collaboratory we store them in an S3 compliant object store using the same software that allowed us to upload some of the data sets into Amazon. So basically without having to rewrite the software that you use to upload and access the data in a control manner we can use the same software for both open stack and Amazon. Clouding it is a package which is pre-installed on most recent cloud distributions and what it does is when the virtual machine starts it takes the operating system of the virtual machine like say you start a VM that's running Ubuntu 14.04 and you want to have a flavor which has a size of 100 gigabytes it's going to resize the file system when the VM boots to use the entire 100 gigabyte disk if you start with a medium flavor let's say it's going to extend it to use the size like 300 gigabytes medium and so on. So basically when you SSH into the virtual machine you see the entire physical space available in your file system otherwise because it's on the root disk you might not be able to extend it by your own because it's already mounted. This is just one of the things that does it for you growing the file system. Another thing is takes the SSH key that you choose to be injected when you start the VM so downloads that key from the cloud and puts it in the Ubuntu user's home directory so when you SSH with your private key you are going to be allowed SSH access into the server. You can also when you start the instance you can pass in a file that's going to be available in the VM like it's not going to be over SCP it's not going to be attached as a disk it's just going to be in the location where you want it to be. You can run a script and you'll see how you can start an instance and give it a script that's going to be executed on boot and basically it allows for a great deal of automation which is very useful for when you have to scale your workload and you have to do many more things with less effort and less time spent configuring and replicating your steps. Another thing is to do a callback to an URL when the VM is finished booting so pre-registration this kind of thing so the VM starts you don't want to wait for it to be ready you want to be notified when it is done so you just have an API server that listens and you tell the VM hey when you are done just do a post request on this URL and as your VMs start you'll see them registering themselves and then you can do other actions based on that information. This is a link where the CloudInit project has its documentation they have a lot of examples of useful things you can do with CloudInit. The nice part is that Red Hat, CentOS, Ubuntu, Debian, everybody pre-installed this package in their distributions so when you go to CloudProvider if they use an image of these distributions that is the cloud version of that distribution is going to 100% sure have the cloud in its package installed so you can make use of it. Let's look at Docker so Docker is an open source project again started about a year ago or so automated deployment of Linux applications and it takes an application that you write and it creates like a container or a larger package of it. It allows you to basically give that application that Docker container to somebody and they run it and it's just going to work so you can develop your application on your Mac, you create a Docker container, you give the application to your system admin who has Debian 7, he runs it, it's going to work. He gives it to his colleague who runs Ubuntu 14, runs it, it's going to work. No more complaints that it doesn't work on my system but it works on my system, what version of libraries you have, there is a conflict, I cannot update my libraries to the version that you built your application against and dependency issues. What Docker does basically it takes all the things that are needed by your application creates as I said like a container and it's very portable and guarantees the same runtime environment wherever it runs and it also has some other benefits in terms of overhead and speed of deployment and so so this is another analogy basically you can run an application that requires the latest version of a library so let's say your application is written in Python 3 and you want this container or this application to run in your HPC cluster but they run Debian 6 which has Python 2.7 and they try to run it and it doesn't work. If you give them a Docker container, Docker container has inside when you build it the Python 3 pieces that are needed for the application to run so you can easily run it on a host that has Python 2.7 only and it's just gonna run no problems. They are much smaller than full virtual machines because they share the kernel and the main operating system is just the changes that are captured in the Docker container so if you have 10 Docker containers you are not going to have 10 times the Ubuntu kernel like you would have with 10 virtual machines so space is saved because it doesn't have to boot an entire operating system, it doesn't have to go to the entire boot sequence of the operating system which is fast for Linux but still can take a few minutes to have a shell prompt ready for you. With Docker it's a matter of seconds because it sits closer to the operating system, there is no emulation layer that the hypervisor would have to do, it has much better performance, closer to better performance, to better metal performance but of course there are downsizes because your Docker container is going to run on the operating system like a process, there is a risk that whatever you run in that Docker container could exit the container and see the processes from the other containers or take control so for this reason people especially in public environments run Docker containers on top of virtual machines, virtual machines have a much better security profile and hypervisor is able to basically create like a fence around the virtual machine so it cannot exploit the host or other virtual machines so for this reason at least for the moment people who run Docker container or bare metal do it either in their dedicated environments so you have three servers and you run a hundred Docker containers but it's just your Docker containers if one of them does bad things it's your Docker container it's going to do bad things to your other Docker containers but you won't see cloud commercial or public cloud providers running Docker containers directly on bare metal usually what they do they have virtual machines up and running and they just schedule your Docker container to run in a dedicated virtual machine so it's a technology that's still new hopefully they'll make security improvements in the way they control the risks and will be able to run Docker containers in a multi-tenancy environment on the same physical servers so you get speed, space savings, all the benefits without the current security risks. It's an example of how the containers are different than VMs so basically you see here that you have the physical server in the black box okay and then you have the operating system which would be usually Linux operating system on top of that on the left side you have the gray area which is the hypervisor which is the KVM and then you have virtual machines that have their own operating system okay so you can have a Ubuntu host with Red Hat virtual machine and the Red Hat virtual machine is going to have its own the Red Hat operating system with its binaries and libraries and you have your application and you have three of these you have three times the kernel the space used by their kernels and their libraries so pretty inefficient on the right side you have a physical server with the operating system which still has to be there to manage the hardware side and then Docker containers they share the kernel and the main parts of the operating system it's just your application so the overhead is much much less than with virtual machines and that's why also they start faster and they run better than virtual machines. It's another visualization of why they are so lightweight basically the same as I said so guest OS is triple replicated like three times the the VM pieces here is just if you start with the Docker container let's say you pull a Docker container which is Ubuntu 14.04 okay and then you pull another Docker container like a bioinformatics container that was built based on the Ubuntu 14.04 it's not going to download from the internet from the Docker hub again Ubuntu 14.04 with the changes it's going to say ah you already have Ubuntu 14.04 downloaded as a Docker container and you want something that is based on that so you are going to download just the changes the file system layers from the Docker hub if you don't have any Docker images downloaded then it's going to have to download first the Ubuntu plus its changes okay but as you have more and more containers that use or are built on top of each other you'll have a lot of space saved and bandwidth and it's going to take longer to to to pull a new container from the internet because you already have most of it local cache. DOCSTOR is a project is a research grant granted to OICR to build a Docker registry for bioinformatics so there is a Docker company provides a registry which is like a web application where you can search for Docker containers it's about a thousand of them but they are general purpose everything from a patchy to nginx to any application that's possible has a Docker container and some of them are malicious some of them are uploaded by the companies or the software developers who manage the projects but they are general purpose as I said DOCSTOR is a web interface where bioinformatics and researchers can register their Docker containers that contain bioinformatics tools and workflows and they collaborate and they uh you basically it's it's like a custom made Docker registry for for bioinformatics and they still actively developed and later in module five Solomon is going to talk more about that in Cristina. Currently as I said there are a number of bioinformatics tools 27 last time I checked we'll use in the lab module one of these Docker containers to see how it is useful and as you can get more familiar with the technology you'll probably be able to create your own Docker container with the tool that you might write and you it's going to be very easy to maybe register with DOCSTOR and tell another user to pull your Docker container from there to try and use it. Another important thing about DOCSTOR is that it has description for how to run the containers with the common workflow language which is standard for describing input and output for cancer genomics workflows okay because in Docker you you have usually provided from the whoever created the Docker container just an example of how to run it okay but doesn't tell you what is the input what is the output in a standard way with DOCSTOR there is a client that you can use to generate a YAML file where you describe this and then you execute the Docker container with this file and it knows where to get the data what to do with it where to output the results okay in in a standard way and I think this was the last slide for for the lecture part I know that I'm speaking pretty fast so if you have any questions please be free to interrupt me so you mentioned Django yeah so it's like part of Python that allows you to build interfaces yes Django as far as I know is a web development framework written in Python I'm not a web developer I'm a infrastructure guy and cloud architects so but my understanding it's because it's a newer framework it's a higher level higher language framework and it's easier to learn than other frameworks but of course there is a learning curve but it's not a compiled so basically all the files that you have in Django are written in clear text it's much easier to edit and to see what's inside and to make a change and just refresh the page and see then having in Java or in other language where it might be more more complicated to see immediately how it reacts to your changes yeah it's spelled DJ I can tell about collaborator because I know that you had these questions in the previous lecture so right now we have 3.2 petabytes of raw storage because we store the data in Ceph which is another open source project Ceph uses a replicated model so every file that we upload to the collaboratory the client when uploads let's say uploads a 100 gigabyte aligned BAM okay the client does a parallel upload does a parallel multi-part upload which means that it breaks the following one gigabyte smaller files okay and uploads them in parallel the server receives the files gives the files to a load balancer a number of web applications to the servers that have very large numbers of drives so we have 36 terabytes 36 drives per server so we are talking about four U chassis or U is like a pizza slide imagine four of this okay so large servers they have 36 drives and use we started four terabyte drives and then we moved to six and now we have eight terabyte drives so I have about 200 something terabytes per server okay and in a rack we have eight servers so with a fully populated rack eight servers of 36 times eight terabyte we have about two petabytes in in a rack and then for compute we have high density chassis which means that we use some two U servers that have four micro chassis inside so the two U server has a metal chassis that has two power supplies and fans but there are four servers that share these chassis and the four servers are individual servers you can take them out one at a time no problems each server has its own motherboard each server has its own CPUs dual sockets its own memory network cards etc in front of the chassis there are 24 drives each of these four servers is going to see six of the drives okay so in such a small space you have basically four servers each server has two sockets we use now 10 cores two sockets of 10 cores it's 20 cores with hyper threading enable which is the feature in the in the Intel chipset that allows you to basically present two logical cores of the operating system the operating system is going to see 40 virtual 40 logical cores okay 256 gigabytes of RAM 10 gig network interface and the six drives that the server has access to are two terabyte drives so we have 12 terabyte local storage for each compute node of course we don't present the 12 terabyte directly to the operating system we use the rate 10 which means that we have stripping and parity okay so only half of that space is actually visible in the operating system of the compute node after formatting and some overhead we have about 5.3 terabytes of local storage it means that basically when the researcher starts a virtual machine if it gets like a eight core VM okay it's going to get a quarter of the server or a fifth of the server and it's going to get a quarter or a fifth of the memory and of the disk so when you go to the lab side you'll see that like you get one core 300 gigabytes of space and eight gigabytes of RAM which is very very good ratio for resources so if you want to do an alignment you start with an eight core virtual machines and you'll get about 1.3 terabytes of disk and 56 gigabytes of RAM which allows you to basically download the unaligned BAM let's say okay it's going to have enough space locally to fit okay does the alignment will generate at least the same size so you need basically double the space that and we had the largest BAM which was 800 gigabytes so if you have a very high coverage sample with a very diverse mutation then you can have you need a lot of local disk and we designed it like that because we didn't want a failure of a compute node to affect other work workloads okay so in cloud the idea is that failure is unfortunately a given when you operate at scale as I said you'll have failure okay if you have one server of 100 failing if you have 100 servers you have 1 chances now if you have 500 servers you have five servers failing so the more you grow the more you'll see failure maybe you'll not see it individually but it exists okay and the cloud is a paradigm where Amazon and Google and nobody guarantees you that there is not going to be failure they guarantee that the control plane of the environment is going to be always available which means that you can always start more VMs if some of them fail for some reason okay and it's the same design concept here basically it's not spending money to make things unfailable because we know that it's going to cost a lot of money and it's money that you can spend to grow capacity and even if you have everything dual you are still going to have some failure so it's cheaper to deal with the failure than try to avoid and that's basically how collaborator is designed deal with very large files and you have very good connectivity network connectivity a lot of memory cpu but you have to be to get used to work in smaller failure domains so you start 20 virtual machines each of them downloads the files they need to the local storage they work with them if some of them die no problem it doesn't affect the other 19 okay you start more you continue your work if you work in a cluster like environment where everything is critical and one part if it fails you waste days or hours of of work and you have to start again that's not a good model