 So, the module is called Sharing and Scaling on VM. If yesterday we learned how to start a virtual machine and how to install Docker engine and pull a Docker container that has some bioinformatics tools, download data and run a short analysis on that data. Today we are going to learn how to scale this entire process for multiple samples and different options of how you'll be more productive with your time and with the resources that you have available. So we'll focus mostly on freezing or snapshotting a virtual machine that you customize, how to share that snapshot with other cloud projects in the same environment, so not your colleagues that see your VMs, but other projects that have accounts in Collaboratory, or if you are going to do this kind of work in another cloud environment based on OpenStack, it's the same methodology, and how to launch a virtual machine from a VM that was shared with you or from a snapshot that you took yourself. So you start the virtual machine, customize it, do some work, take a snapshot and then two months later you want to start from the same point in time, you have that saved work and it's easier to continue your research. And last point, how to scale your VM fleet to more than one and how to be more productive. Let's discuss about what a snapshot is and some of the things to consider in taking a snapshot. When you take a snapshot, basically you capture the state of the disk of the virtual machine at that moment and that image is then uploaded into a central repository. So in cloud environments, there is a central repository where the base images are stored. So when you start the virtual machine, there is a scheduler that decides based on available resources in the physical environment where your virtual machine is going to be scheduled to run. So if you have 30 physical servers with 256 gigs of RAM and 40 CPUs and you want maybe a VM that has eight cores, the cloud scheduler is going to look at your compute nodes, how much resources they have available, it's going to filter them, it's going to sort them and it's going to decide, okay, I'm going to start this VM on compute number 14. Cloud compute number 14 doesn't have locally stored image that you want to use for your VM, okay, because maybe it never runs there or maybe it runs there but if it's not in use after a while it gets deleted so it doesn't cache continuously. So it has to download this image from this central repository over there before it can start the virtual machine based on it, okay, and knowing how it works, it helps you decide how much data you want to put on your image, on your instance before you snapshot it because the more data you store there before you snapshot it, the longer it's going to take to start instances based on it later on. And also it might be charged for larger snapshots, okay, because the cloud has to store your snapshot, if it's larger, it's going to take more space, space means more hard drives which costs more, so if you do this exercise for example in Amazon, Amazon charges like 10 cents per gigabyte per month, okay, if you snapshot a VM that has 300 gigabytes of data on it, okay, and you keep it for three months it's going to cost you a few hundred dollars or less. If you do some planning before you take the snapshot, say okay, I don't need to have actually this data in there, okay, maybe I keep the data separate from the binaries and from my application, okay, then my snapshot is going to be smaller, I'm going to start my virtual machine faster from it, and it's going to cost me less, so important considerations. Another important thing is that if you start an instance and you, let's say, attach a volume to it, okay, remember I told you yesterday about creating volumes that allow you to expand your disk storage to more than what is provided in the VM, okay, so you start the instance, create a volume, attach it, maybe you can do this in the lab if you have time, and then you take a snapshot, the volume is not part of the snapshot, okay, the volume is something that is attached to the instance, and when you capture the instance, you don't capture what is attached to it externally, like the volume, so it's good to know that if you put important data there and take a snapshot, expect it to be there next time you start an instance from the snapshot, you'll be surprised it's not there, okay, so anything that you want to be in the instance, install any applications in the root disk, which is normal in the Linux distribution. We'll talk about some considerations, some of them I already told you about, like size, but there are others that are important, so before hitting the snapshot button, it's important to do some cleaning up and to know how the system works and the impacts of your decisions, so let's look through some of these considerations, first, it's important that you clean up your public key that exists in the instance, especially if you want to share that instance with somebody else, it's like giving somebody access to your house for a weekend, but also keeping your key, or maybe not the best analogy, but the idea is that if you don't clean up your authorized keys, they will use that VM to start instances, their public key is going to be added, appended to the authorized, okay, so they will have access with their private key, you will also have access unless they look and they clean up your key in there, okay, so it's polite, you should know that you'll be sharing it, don't leave your accesses in there, and how you do that, basically you remove the authorized keys file where your public key is added, also it's important to clean up any confidential data that you might have saved in the instance, like we talked about yesterday how you set up an application properties files, it's going to be used by your storage client download data, that file if you are DACO approved is going to have your token, your token is confidential, very important not to leave this file in there with a token, or you might download data from a Genos site, which is another protected data repository, also use this token based authentication, the same concept, maybe you download data from Amazon S3, you have a file containing your credentials, clean it up, so anything that is confidential shouldn't be in the snapshot, not only that you might share it later with somebody else, but also you forget that you put your credentials in there, and it's better not to leave them in there in the first place, also it's a good practice to clean up your batch history, if somebody starts an instance from a snapshot that you shared with them, if they look in the batch history, they might say comments that you run, some comments take credentials as part of the command, so the batch history is going to show, so you think that you deleted your credentials, you are safe, but they just look, they run history to see something they run, in their history and they see oh look whoever gave me this actually, I see his password that he used, or his token, or clean up your batch history by removing the home directory's .bashhistory file, or also history.c resets the current batch sessions history, the history is saved in that file when you log out, so if you run three commands and you look in the history file, you will not see them right, when you log out they will be written from the memory to that file, so it's important that remove the file and also un-set your history before you log out, keep the image size small, as I said when you save data on the image the disk of the image balloons, we use QCAU2 file format for the image, so if we have a computer that has five Ubuntu instances running, we only have one base image for Ubuntu 16.04, and the five VMs have virtual disks that use the base image as a read-only base image basically, and whatever changes the users make into their disks are captured in a file that keeps growing, so if you start five instances that are scheduled to run on a compute node, that compute node is going to have one let's say 300 or 10 gigabytes of Ubuntu disk and five smaller files, if you download data from the internet, your small file which captures the changes in your virtual disk is going to start growing, so it's going to balloon basically, even if you then delete those files, it doesn't shrink back, okay, so it is as large as it was able to grow by you downloading data and expanding its usage, so it's important, you start an instance, you download the data, you do an analysis, you are happy with it, and then you not necessarily take a snapshot, you could start the second instance, just install your application that you know that it's working now, but don't also download the data, okay, because even if you delete it, that snapshot is going to be larger than it could have been if you just put your application in the first place. Very important, plan your image usage, so try to think, okay, what I'm going to do, I have this instance, I customized it, I cleaned up my bash history, I removed my credentials, everything is good, how am I going to use it, am I going to start one instance and repeat the process with that one, am I going to do 50 at a time, what's going to happen? So probably you will need some data in addition to the binaries that you have and the application that you will have in that instance, so it's usually good to keep your reference data separate from the instance, okay. You can start an instance in your cloud environment just to provide the reference data, install a web server or an FTP server, download your data there, and as your instances start, they will go and pull the data locally in the same cloud very fast, instead of going to a thousand genomes FTP server or somewhere outside over the internet where you will have share access to that site with users all over the world, you'll have your dedicated reference data locally in your environment without having to replicate it 50 times, only when you use it, you'll have it available. So one way to do that is execute a script that downloads the data from the FTP server, does security updates, other preconfiguration steps. So proper planning will save you, as I said, time later, your instances will start faster and you'll spend more time doing the analysis instead of waiting for them to be ready and also there is a cost component that's involved here, especially in the public cloud, the providers where everything is charged. This is a snapshot, we'll go in the lab on how you take a snapshot. The idea is that it's pretty simple, it's just one step. You click a button and you give it a name. The name should be descriptive, so later on you know, like you can say, a snapshot before installing whatever or a snapshot with fully deployed workflow and maybe a data version, it's up to you how to organize your images. If the snapshot is going to be larger, it's going to also take longer for it to be uploaded to the central repository. So your VM runs on copy number 14, as I said, and now when you take the snapshot, you'll have a few gigabytes or whatever size the snapshot is, it has to be uploaded to the central repository. If you had a lot of data in there, even if you deleted it, it's going to take a while for the entire balloon root disk to be uploaded. So you'll see in the dashboard, in the web UI, multiple stages for data operation. It's going to be snapshotting and then queuing and then saving. When the snapshot is taken, the VM is suspended, which means that you won't have access to it anymore over SSH. Also, if you remove the SSH key as it's recommended, if the snapshot doesn't happen fast enough for the SSH session to re-establish, the SSH session is going to be totally terminated, which means you'll have to initiate the SSH session back. When you do that, your VM is not going to recognize you anymore because it doesn't have a public key, so you'll be locked out. But no worries, the snapshot is there. You can start from the snapshot of a new virtual machine. So you might as well, after the snapshot is finished, terminate the old instance because it's unusable as it is. Sharing the new image with another tenant project is a very useful use case of taking snapshots. Not only it allows you to start in the future from the same point in time, but also it allows you to take a virtual machine, customize it, make sure it works. We have it and then you have another researcher, but it's not in the same project as you in OpenStack. But you know that they have an account, let's say, as a collaborator. So you are from McGill and you collaborate with somebody from UBC and they tell you that, yes, we are having an account collaborator. Okay. Can I see the image that you created that has your software and you're happy with it also? And basically you take a snapshot and I will show you in the lab how you share that image with them. And they will see that image in dashboard and they'll be able to start instances based on your snapshot. The snapshot and sharing methodology was used mostly before Docker containers came around. So with dockers you can easier share a Docker container than a full VM snapshot. But there are use cases for both of them. Sometimes creating a Docker container for a complex application that has multiple dependencies, like it's a database with a web server and multiple things, you are going to have a fat container, that's what it's called. So it's a container that you store a lot of things in, so it's still a big container. And for different reasons, you might choose to use a VM or this customization and then snapshot it and use it. Plus, the image repository allows you to download your image outside of the cloud. Okay. So let's say McGill had its own OpenStack base, but you did your work in Collaboratory, you can go and download the image or the snapshot you created. Okay. And then if your admins at McGill allow you, you could upload it there. Okay. Or you can go to a cloud provider in Europe that has OpenStack and you upload it there. So it allows you to basically move images between OpenStack clouds. Sometimes even other types of clouds. Like you download the image, it's a KVM image, but can be converted into other hypervisors image. So this is an example on the slide. Basically, you have a tenant A that has two users, John and Mary, like many people here, all of them actually are in the same project or tenant, so you see each other's instances. Okay. So it's easy for Joel to share an image with Veronica because they see each other's images. Take a snapshot and say, hey, can you please open a virtual machine or start a virtual machine from that snapshot I took. But what if Joel wants to share this image with George? George is another project. Okay. So he has to basically, unfortunately, the functionality is not presented in dashboard yet. So it's not easy to just go to dashboard and say share this with. You have to install the Glance client. Glance is the project, the OpenStack project in charge with image repository. It's pretty simple to do in the lab. So you install the client, you have to have ready your credentials, you need my project ID, and you need your snapshot ID. So you need to know what to share with who to share. And then I will see in my dashboard in the images shared with me, I will see the image that George shared. It's pretty straightforward. Once you do it once, it's not too complicated. And then it's going to the images. I can find the snapshot and launch from there. Another important thing is when you start the VM that you are going to snapshot, start with the smaller flavor that allows you to install your application. Why? Because if you start with, let's say, C1 medium image, and you take a snapshot, and then I want to, from that snapshot, start an image that's smaller, like C1 micro. It's not going to fit. And I'm going to get this message, I'm not sure if you can see in the slide, that this flavor disk is smaller. It doesn't fit. So it's always good to start with the smallest image size or flavor that allows you to, the space needed to customize your application and make your changes. And that instance to snapshot it. Now let's look at a scale-out planning exercise. I thought that we are going to look how a researcher that did an analysis on one or two samples is happy with the results, but to get meaningful results, he needs to apply the same methodology over 100 samples. And the example says, so you perform, let's say, VCF and you have to do it on 100 samples, but it takes 24 hours on a VM with 4 cores and 16 gigs of RAM. These are your average run times. But for this project, we only received enough money to run the current prices of the provider 100 cores for 72 hours. Let's say it costs 10 cents an hour per core and you have, this is the budget, so you calculate, okay, so if I run, I need 100 cores for 72 hours, I'm going to run out of money. Also, you have a quota, okay? You will say, okay, I'm going to use all the money in the first day and I'm going to finish right away. It's not exactly like that because you also have a quota at least in smaller environments. In Amazon, probably you can go and say, I have $10,000, I want to spend them all today. Yes, we have capacity to do that, okay? But even there, if you say, I have $10 million and I want to spend them all today, give me all the CPU, they will say, well, we don't have $10 million worth of CPU capacity just for you. So there are quotas everywhere. For this exercise, we assume that you have a quota of 100 cores, so you cannot use more than that in a day. And also the samples that you have have different sizes. Most of them are around 180 gigabytes, okay? Which is an average size for a line bam, let's say. Problem is that 15 of this 100 actually have higher coverage or more complex mutations and take 210 gigabytes of disk space. So you won't be able to use the same flavor for all the VMs. So you have to basically do some math to see, okay, how can I do this? So these are the flavors that your cloud provider offers, okay? So we are looking at a C1 medium that has four cores and enough disk for most of your samples, okay? And for the last 15 samples you need probably C1 jumbo, which is 8 cores and gives you 320 gigs of disk, okay? So you don't want to go with too large instance. It's going to cost you more and you don't need the extra space like C1 extra-large. It has 400 gigabytes, 12 cores, and it costs more, of course. So what you can do is, actually I have this set up. I'm going to show you if you have time after you complete your lab. Have a simple Python web server that when you do an HTTP request on a specific path like sample small is going to give you back the UID of the average sized samples, okay? It's like a scheduler. It allows you your VMs to receive the sample ID they have to work with without you having to stay there and give them the work, okay? This allows you to script it. So if a VM finishes at 2 a.m. and you are not there to give it more work, it's going to take by itself, okay? So it allows hands-off automation. So what you can do is you start 49 VMs with C1 small, sorry, this is four cores in the slide, okay? So no, sorry, two cores. So 49 VMs is going to use 98 cores, okay? So you almost use your quota for a day. And then you use something like Ansible or you can use, we'll see, a cloud init script that tells the VM what to do. As the VM starts, it's going to install its packages or maybe it's a snapshot that already has your application installed, preferably. And then it's going to visit this website where you have the Python server and it's going to do a GET request on that URL and it's going to receive back a sample ID. With the sample ID, it's going to run a download, it's going to download the sample ID and then it's going to then run your workflow and then it's going to upload the result somewhere and then it's going to do a POST request saying, hey, I'm done. So the scheduler is going to record that sample was analyzed. After 24 hours, basically you have 49 samples completed, okay? And in the first day, you used less than 100 cores, which was your quota. The second day, you start 34 more M1 flavor, C1 small flavor, okay? And that uses 68 cores and then you start C1 Jumbo, which gives you enough space to analyze the large samples. At the end of day two, basically you have 83 average samples analyzed on four of the large ones, okay? Again, you used most of your quota. 32 cores plus 68, that's exactly 100. You are in day three, you only have a few samples left, okay? You repeat the same process, but use just two C1 small images to finish the small samples and you start 11 Jumbo flavored instances to complete your analysis. So at the end of day three, you basically have finished your samples and you still have some budget left in case you have to rerun some of them that fail for different reasons and you basically have your goal achieved. But again, you had to go through this planning exercise to... and you need this scheduling tool that allows you to get UIDs automatically. Of course, you can also manually start 40 VMs and based on the instance type, give them UIDs of the small samples, but that would be more time consuming and prone to errors. So it's important again to pick the right flavor for the job. You could just say, I'm going to just go with the largest one. The largest one is going to use your CPU quota faster and it's going to cost more and you don't need the largest one for the small VMs, for the small samples. Another important thing, start small and include your VM fleet slowly. If, especially in the phase where you find your methodology or procedure, you just start, okay, I'm going to start, 48 instances, whatever you have a problem in your instance, it doesn't work, you have to shut down, start again. If you do this in Amazon or in OpenStack somewhere where you are charged by the hour, you will see that in three hours of testing, you wasted a full day of budget. Okay? So start small as you gain more confidence in the process ramp up. Mistakes, as I said, can be costly. If you start an instance and delete it 10 times in an hour, it's going to be 10 hours. And you can do this more than times in an hour. You can do this every minute. And the cloud provider is going to be happy to charge you for every time you start on OpenInstance. So monitoring is something that is very, very important. And it shouldn't be done, in my opinion, after you deploy your application and your workflows. Think of monitoring from the beginning. You have to monitor the process and what the instances are doing and be ready to rerun analysis that are stuck. They can be stuck for many reasons. Maybe you download data from an external site which is not always available. So like we did for Pico, you start instances, they download from Europe. Quantity is not too good. Maybe the European site is not always available and it's stuck in download for six hours and stays there instead of finishing the downloading and continuing with the analysis. It stays there. If you leave it like that, you don't look at it for a week. You wasted a week of time and you wasted a week of CPU allocated to a VM that does nothing. So it's very important to have monitoring. And if you don't plan your monitoring at the beginning, it's going to be harder to do it with the VMs running. And it's important also to think that metrics will be useful to have and to track. Maybe you have your workflow set after it finishes every chromosome, update the file saying the status chromosome one is done or maybe you have some checkpoints in your application that says state where it was and those metrics should be probably made available outside of the VM. So you don't have to go to each VM to collect them. The VMs can push this metric to an external web server and you just go there and you see them aggregated. So especially with large scale analysis where you know that the process is going to be long, just rushing into it is actually going to cost you more time as you progress. I didn't think I need this. I didn't consider this now. So planning everything, especially for a longer project is very important. My recommendation from the experience with the ICTC Pico project use the loosely coupled monitoring systems that scale well. So don't use an agent-based monitoring where you have a server that has to go to each client to pull the data. Because you don't know the client's IP addresses, the clients are dynamic. They start and they terminate it and if you have to go to the server and to make configuration changes every time a new client has to be provisioned, there's going to be a lot of work for you. What you want is a new generation of monitoring system like Senso and there are others that basically the VM, the clients have a small agent installed but the agent is only configured with the IP address of the server and they go and they check RabbitMQ, it's a messaging Q application to see what they have to do. By themselves, you don't have to tell them to do that. As they start, they just go there and they see I have to run this script locally to check for free memory and report back. So the Senso server is going to see new clients showing up in its dashboard and as they are terminated, they will stop reporting and you can clean up those that are idle because the VMs might not be there anymore. There are other solutions based on the same design pattern. Basically, they are called loosely coupled, it's pull-based, sorry, push-based. The clients push the data to the server, it's not the other way around. Basically, you come into the class, you check in, I don't have to ask everybody if they check in. Another good recommendation is to minimize external dependencies. Your workflow needs reference data. Don't use an external reference data server which might be there but might be a bottleneck or might not be available. You don't want to be slowed down or have your analytics interrupted for external causes. So it's easy to set up your own server that has your dependencies. Also, if your application needs a package from a repository to install, that package repository might not be available all the time. As we are doing PICOG analysis, large scale, sometimes Java repositories were not available and that would fail the provisioning process of the application. So you have VMs that are up and running but we couldn't continue the installation of the application because a Ubuntu server or a Java repository was not available. It's complicated to go back and to restart the configuration and you don't want it. It's better probably to start an instance, configure it once. It's working. Everything is there. Make a snapshot and use that one. Failure domains. As I said, in cloud environments, it's different than corporate IT where a lot of money is spent in making things very reliable in terms of hardware failure. It's a different design pattern. Basically, failure, as I said, at scale, it's a fact of life that's going to happen. And as you scale, spending the money to make everything fail-proof, knowing that you'll still have failure, it's just that now you spend a lot of money trying to avoid it. It's not a good app. So it's better to have your workflows run locally and not rely on each other. And if you have a failure in one of them, I'm just going to reschedule those that failed. Nothing else is affected. I can still progress. So smaller failure domains is the way to go in environments where you can unnecessarily control the physical infrastructure. You are not the only tenant. You share those resources with other projects, other companies, maybe other researchers that sometimes can run very intensive workflows that will affect your own. And I think I finished my slides. I have any questions? No questions? So say after doing all the analysis, I'm done. I'm happy to just log out of all the instances to remember. This data I'm seeing... You can either create a volume and save your data on the volume. And that volume is going to survive your instance termination. But again, it doesn't really scale. If you have 15 instances, you have to create 50 volumes, each of them. 50 volumes left after you terminate the 50 instances. A better way is to... And we will see in the lab, I'll do a demo where the instance, when it finishes, it copies the results of the analysis to an S3 bucket, or it's like an object storage. And that is available over the Internet from anywhere. So as a collaborator... I will actually show an example in the lab where I start 10 instances, each of them takes its own UID, does the BAM stats, and then uploads the data to its own bucket. And I will see the buckets being created and populated with the results. And then I turn the instances, but the results are saved. Another good recommendation, save the results of the workflow as soon as the workflow is finished. There is no guarantee that it's going to be there later. And why would you keep it locally? Like, don't do five workflows on that instance and say, I'm going to upload when all five are done. Do them, upload them as soon as they finish. Put them in a safe place. And then if something happens, you will lose more than one in progress workflow.