 Hello everyone, my name is Jordan Haskell, thank you for coming to this workshop and hopefully you'll enjoy it. So working with big cancer data in the cloud. The learning objects of the module, the first 45 minutes, I'm going to go through a quick evolution of the cloud technology and then I'll present the difference between classic HPC that most of you might be familiar with from your research organizations versus the newer cloud technology. And I'm going to discuss about OpenStack, what it is and how it is used in research. Then I will cover the cancer genome collaboratory project goals, its history, progress, capacity etc. And then I will discuss the difference between Docker containers and virtual machines. And also the difference between Dockstore and Docker Hub. So let's start because we have quite a bit to cover. If you have any questions during the presentation, raise your hand and if the answer is short I will try to answer right away. If not, we'll keep the answers for the end of the lecture. So computing evolution, a few years ago it used to take months or weeks for a developer to have a new server ready for a project. So the server had to be ordered, delivered, installed, it was a very long wait time for a developer to start actually using and developing and being productive. Then virtualization came along where on the physical server with a lot of course, virtual machines were provisioned and better usage was achieved. Instead of having one web server that uses maybe 10% of the CPU capacity on the physical server, now you have 10 virtual machines running web servers, each of them using 10% but the total utilization of the physical CPU would be close to 100%. So save up space, save up money, a lot of benefits. So from a few months wait time it went down to a few weeks maybe or a few days because still the IT team had to provision the virtual machine and the networking team had to configure the networking for the virtual machine, security team had blessed the server and so on. So the time went down but still days if not weeks. Then cloud computing came where the developer now can self-serve through a portal, provision a virtual machine, provision a network, provision the IP address, do everything and it takes barely minutes and he pays only for the usage. If he doesn't need a virtual machine at the end of his work, he stops it or he terminates it and no more charges. So very, very powerful tools for developers and any compute users. It could be researchers, bioinformaticians, anything that requires compute capacity. Then Docker or container technologies came along and starting a usable unit of compute capacity went down to seconds. If you start a virtual machine maybe takes two, three minutes, start a container, takes five seconds even less depending. So it terminated and the charge disappeared even faster. The future, basically a new technology was introduced by AWS which is the market leader in cloud computing called AWS Lambda or Lambda Functions where you basically do not start a VM, you don't start a container, you just upload the code that has to be executed and the cloud computing platform is executed code and it's going to charge you just for the number of functions that were executed and for the time it took to execute them. It's tough having a web server that is waiting for a user to access a web page and that server runs 24-7 waiting for somebody to visit the page. When you visit the page, that's when the web server in Amazon basically starts, responds to your request and then terminates. So you just pay for the time that it took Amazon to respond to your request and deliver your web page. So it's maybe fractions of a second, but you have to write your code using the Lambda Function. So it's a new computing paradigm, so it's going to be a learning curve, but for some use cases it's going to be very, very efficient to do it like that. Most of HPC or high-performance computing, most research organizations have in-house HPC clusters, some green engine or other types of HPC clusters that they are shared, meaning that all the users have access to the same servers and put jobs in the queue and the jobs are run by the compute nodes in the cluster and they are fast because they are purpose built and sometimes they are expensive because they use specialized hardware. They have a fixed configuration, meaning that you cannot say, I need servers bigger or smaller or different configurations. The IT team deploys it like that and that's how it stays. And if they deploy with Debian 7, it is Debian 7. You don't have root access, you cannot install a different GCC library or you cannot change anything. You are a user on their environment, they manage, they have to secure it and they are responsible. So you are pretty much limited in what you can change. Because the Docker containers share the kernel of the host, usually they don't let you to use container technologies in HPC clusters. You cannot access your servers from outside. So if you want to run a web server on HPC cluster or something else that needs to be accessed from outside your organization, you won't be able to do that because they have private IP addresses, they have firewalls and they are not going to let you connect from outside to these workloads. And usually there is no cost recovery, as in everybody pays for what they use. Of course, there are internal mechanics for charge work, like a researcher might have a quota in the HPC cluster and maybe some of their research budget goes to internal IT, but it's not a clear defined cost recovery where you pay for exactly what you use. Cloud on the other hand is also shared in the sense that all the servers are shared between different projects. It is flexible in the sense that the cloud admin can change its configuration. Usually there is no share storage. So in order to have a cloud environment which is scalable, you cannot rely on anything shared because anything shared basically is going to be a bottleneck when everybody starts using it at the same time. So at least in the case of Collaboratory, we are trying not to have anything shared that could be a single point of failure or a bottleneck for everybody. You have root access for users, meaning that the virtual machines that you start, you are basically a root. You can install any packages you want. You can install Docker engine or other container technology and run Docker containers inside your VMs, update the kernel, do whatever you want. You can control the firewall rules. You basically can assign a floating IP or a public IP basically to the virtual machines and that allows it to be accessible from anywhere. So if you collaborate with somebody from another country and they have to access your application, you can just assign a public IP, change the security group, allowing access to your VM from China or whatever that collaborator is, and you give them access. If you want to leave it open, you can leave it open, but you are responsible for the security of the virtual machine and for respecting the best practices and also you can bring your own operating system. So if the catalog of operating systems available in the cloud doesn't have the operating system of your choice, maybe you are used to use Scientific Linux or some other type of Linux based distro and we don't have it or you can upload your own image for that operating system. So you have great freedom and flexibility in how you use the environment. Also, the environment has small failure domains, meaning that if one compute node is impacted by a very heavy load, other VMs running on other compute nodes are not impacted. If one compute node dies because of a kernel panic or a hardware failure, the impact is just on the few VMs of virtual machines running there. It's not a global large impact, so small failure domains. And there is cost recovery, meaning that you are charged exactly for what you use and for the time you use it. So you don't have a quota and you pay for the allocated quota all the time, either you use it or not, but it also means that you have to be responsible on how you use the resources because if you use them, you pay for them. So if you leave virtual machines running for three days after you finished working with it, that's going to be costing. So different working conditions. Tens of cloud technology providers, probably most of you are familiar with the largest players in the market, public cloud technology providers, Amazon Web Services on 40% I think of the market right now. Google and Microsoft came late, but they are gaining market share as well. On the private cloud, there are a few options, Cloud Stack and others. But OpenStack that started about seven years ago is pretty much the choice for deploying private cloud computing and I'll show you later why. Some of the large OpenStack users on the commercial side are AT&T, Verizon, China, Telecom, eBay, PayPal, pretty much. All the large companies deployed OpenStack-based private cloud computing environments because you don't pay any licensing fees. You see the source code is written in Python, you can extend it and it provides great benefits to anybody who has the needs and the buying power to deploy cloud environment. It's not for a few servers. You have to reach a certain scale to actually decide to do this in-house. But at the end of the day, there are clear benefits for doing that. In terms of scientific OpenStack users, most of the academic research institutions have deployed OpenStack CERN is probably the largest one. They have 100,000 CPU cores deployed with OpenStack. OICR has a consortium collaboratory where we have close to 3,000 cores plus another internal only OpenStack-based environment, 3,000 cores, DKZ in Germany, EMBL, EBI in London, ETRI in Korea and many others that are using OpenStack. So the knowledge that you get here today and using OpenStack to do research can be used in other places where you'll find OpenStack. Plus, if you don't know anything about cloud computing, if tomorrow you go to Amazon, put your credit card down and open up an account, you'll see that the workflow for creating an instance is pretty much the same as in OpenStack. The steps required and the questions that you have to answer. Of course, the user interface looks different. The terminology might be a little bit different, but the concepts are the same. So discuss about cloud computing with OpenStack. So it's a free and open source software platform. It was started by NASA in Rackspace in 2010. So Rackspace had a lot of images taken from space that they wanted to store. And they were working on a computer as a service platform. They started to write in Python. Rackspace, which is a data center provider, was back then in the US. They had a service for storing files as objects. But they wanted what NASA was building. So they decided to basically open up their source code and create a project that benefits both. All the other companies joined the OpenStack Foundation later, like HP and Dell and IBM and everybody else, because they were trying basically to give an option to their customers to build their own private clouds instead of just going to Amazon and Google. Amazon and Google are not buying hardware from Dell and HP and IBM. They're building their own. So all the major IT companies decided to join the OpenStack Foundation and commit the resources in money and developer time to enhance the project. This is how OpenStack became so powerful and so well-represented in the industry. OpenStack has a release cycle of six months, which means that it's changing very rapidly. It means also that the teams that support OpenStack have to update it quite often and it's not easy to support it because it's changing so fast. But from an IT point of view, it's very cool because you learn all the time and it's very interesting and it keeps you challenging. Initially, there are just a few core services covered by OpenStack. So compute where you start the virtual machines, networking where you can define your own network, IP addresses, everything else. Block storage where you can create additional storage that you can attach to your virtual machines. Identity for authentication authorization. Do you have a question? Sorry, did you raise your hand? And now there are more than 19 projects that cover various things like database as a service, orchestration as a service, bare-matter provisioning, etc. So the OpenStack framework is growing with new features and new projects. This is one of the simple diagrams of how it looks like internally. You basically have users that can access OpenStack from the internet through a web UI, so a portal, or the API services of different... the API endpoints of different services. So you can talk directly to the volume service to create a volume or the compute service to create virtual machines or to the object storage to upload the file. You can also use all these services through the portal. Each service generally has the same internal structure. They have a RESTful API. So you can build other tools that talk to the API services. It keeps the state in the database, SQL database, and the internal connectivity between the components of a service is done through a messaging queue, RabbitMQ in this case. This is a screenshot of the OpenStack dashboard that you will use a little bit later in the lab. It shows you basically how it has a lot of functionality, although it's not that polished as other commercial solutions. This is the default implementation that comes with OpenStack when you install it. Private public cloud companies build on top of OpenStack, like RedSpace or DigitalOcean or others. They change the dashboard to be basically less... to look less like the default implementation. But in the back end, it uses the same API services. So you can change it if you want. In our case, we just left the default version. OpenStack Sinder, it allows you to create on-demand persistent raw storage that you can attach to a single instance. And you can create a volume attached. It is like an external hard disk. So I have my laptop and I need more storage. I plug in a hard disk and I copy the data there. Then I detach the hard disk and I attach it to the other laptop. If this laptop dies, the data that I saved is on volume. That's no problem. I can terminate this laptop. I still have the data. I can attach it to a new VM. I need more data. I can grow the volume. So it provides great flexibility in how you add additional block storage. OpenStack Neutron is a project that covers networking as a service, and it allows me to create complex networking topologies. So if I have an application that I use in my physical environment that has maybe a three-tier application with a frontier like a web service and a middle tier for processing and then database. And they have different IP addressing schemes between the three. I can do the same in OpenStack. I can create three networks with different IP addresses. And I can attach virtual machines to the different networks. So I can do this by myself without having to talk to any networking engineer, wait for them to process a request, et cetera. And I can also automate all this. And it provides a lot of flexibility on self-serving the needs of the users. Some of the cloud-extended functionality, as I said, software-defined networking, block storage, object storage. If you are familiar with Dropbox or other HTTP-based cloud storage solutions, OpenStack has the same concept where you can upload files, download files. And it's over HTTPS, in this case. And they say it infinitely scalable, not that it's infinite, but it scales horizontally very well. And you can basically add capacity on the back end, decrease capacity, without having to buy very expensive hardware appliances from vendors like EMC. Cloud init is another great functionality that was basically invented with the cloud. It's called the cloud init. It's a package that is installed on most recent cloud linux distributions. And it allows you, when you start a virtual machine with, let's say, you have a base image that has 300 megabytes. And you want to start VM with a flavor that's 5 gigabytes. When you start the VM, automatically, the file system of the VM is going to be resized. So you are going to see the entire space that was allocated, 5 gigs. If you want to start the flavor of 20, it's going to be resized as well. It's a great feature for the user. So they basically get a larger server without having to do anything to see the additional space. Also, when they start the virtual machine, the keys that they assign to the virtual machines will be injected in the virtual machine, allowing them to SSH later on. So zero, basically, touch provisioning. You can provide a file, or you can provide a script when the virtual machine starts. And you can tell it to do this number of steps and without having to do anything else. When the VM starts, it's going to pull down the script and it's going to execute it. So you can automate even more. We are going to do this in the lab section. There are many functionalities built into this Cloud Init package. And if you want to read more on that link, you can read examples of how you can use it. I'm going to discuss now about the Cancer Genome Collaborator, which is the main subject of this workshop. What is the project goal, the history of the project, and progress and capacity, of course, available? So Cloud Genome Collaboratory was a Cancer Genome Collaboratory, was built by YCR, funded by the government of Canada Grants, multiple grants. It was built specifically for cancer research and especially large-scale cancer research projects. We uploaded the data from ICGC, which is a lot of data, and is protected data. So when you start using it, you don't have to download the data from somewhere else. You don't have to wait for it to be downloaded from London and from other data repositories across the world. It was built entirely using OpenStack and CIF and other open source projects. So it was a greenfield deployment, meaning that we didn't have anything and that was harder to build it. And I worked for a few months analyzing data for the PICOC project at ICGC. So even though I don't have a biology background, I have a pure IT background, being a user and doing analysis helped me understand the needs of the analysis and how they stress the system. And we were able to basically tune the system and customize it for the needs of the cancer research. And the project goal was to build 3,000 cores of capacity and 15 petabytes of raw storage. And we also had to build a system for cost recovery because the government of Canada grants were only four years long, which means that to make the project sustainable, we had to basically be able to do cost recovery. So we replaced the failing parts. And as they are not longer serviceable, replace them with newer hardware. As you probably are aware, the genomics workloads, usually they download the large files. If they are raw sequence data, they have to align it. It takes four or five days for a BWMM to run on a single sample on an 8-core virtual machine. The resulting data can be as big as the input data. So if you have 150 gigabytes normal file, the align is going to be a little bit bigger. If you are doing mutation calling, you need the normal and the tumor or multiple tumors. And you are going to end up with five, 10, 15 gigabytes of data, so much less. But you still need two line samples. So you need at least the virtual machines. This needs to be large enough to accommodate this data that has to be downloaded. We recommended the workloads to be independent. So if, as I said, you have a compute server that has three, four VMs running and has a kernel panic or a hardware failure, it doesn't impact an analysis that you run on 100 VMs. And you lose the entire workflow. You lose one workflow that was four workflows that were running there. But the other 95 or 96, whatever, are not impacted. You can just rerun those four, reschedule them, and rerun them. So small failure domains. In Cloud, it's good not to basically run anything very large, where everything has to work perfectly. If you lose a piece, then you lose everything. And we recommend to package your workflows as Docker containers for portability, we'll discuss a little about Docker and reproducibility. Here's the ICGC data portal that lets you select the data that you want to work with, filter it, create a manifest file that has your selection, and you can basically fit it in your client and download all those files. It's a great tool for just finding the data that's available, and we'll go through the DCC portal later in the lab as well. Console General Collaboratory developers also developed a custom object storage client. And this client is great because it uses the AWS S3 SDK to talk to Collaboratory Object Storage, which is S3 compliant, and download the data, but it allows it to do it using authorized transfers. So the researcher doesn't get access to the bucket where the files are. It only gets temporary URL that's good for 24 hours, good enough to download the data. But other files in the same bucket are not visible to that researcher. He cannot upload its own files to those buckets. He can upload his own files using the S3 client to his own buckets, but not to the protected buckets. And you can do parallel downloads. So if you have an eight-core VM, you can actually start with seven threads and have better download throughput. And if something happens and the download stops, you can resume it. So if you are 50 gigabytes and 150 gigabytes file download, you don't start from zero. It checks where it is, and it starts from there. Also, other features that were added by developers were support for BAM slicing. So you can request to get just a slice from a specific chromosome. So if you are looking just for, let's say, chromosome 20 this slice, you can specify download just this part. And with BAM to view, you can see just that part. So you don't have to wait for very large files if you are only interested in a section of the BAM file. Also, you can mount an entire manifest file in your local file system if you just want to browse the files. And if you actually want to read the files, they will be downloaded in the background. But if you just want to list the files or look at their site or so on, you can mount a number of files in a temporary directory in a fused file system. In terms of the cloud usage, in the last two years, it has more than 22,000 instances started with 1,800 in the last three months. So the cloud is used for batch processing. It's used for seminal workloads. You start the VM, the data analysis for three days. You terminate it. You don't want to pay for more than what you need it for. So that's why basically a lot of instances. But that's good because basically it gives a good load to the entire system and it stresses the API services. We have now more than 80 users and 34 search labs for continents from Asia, Europe, US, Canada. We have more than 600 terabytes. In the DCC portal, you can see just 547 terabytes, which is the peak of data. But we also imported more than 120 terabytes from EGA. And that is currently being curated and indexed. So it's going to be showing up in the DCC portal early in 2018. As I said, we have 2592 CPUs, 18 terabytes of RAM. That is about 7 gigabytes per core. So when you start the VM with one core, that's pretty much what you get per core. If you want eight cores, you get 56 gigabytes of RAM, which is a lot. And let's basically tune your workflows to use in-memory analysis and make it more performant. We have 285 terabytes of local disks on the compute nodes where the VMs have their local disk. 7.6 petabytes of object storage. And we will upload even more data in the object storage that's going to allow you to do more analysis local in the collaboratory. So you download this data from the object storage, which is in the same physical environment connected to the same networking switches and is tuned for very fast access. And everything is connected with 10 gigabit networking. So it's a new, more than cloud environment built with new hardware, fast CPUs, and very good RAM to CPU ratios. We also developed our own usage reporting apps. So the principal researchers who open up the projects have access to this application where they can see the usage per day, per user, or per week or per month. They see the usage distribution between compute, volume, images. So they can see exactly who in their lab is using most. And they have a student that forgot to shut down the VM that's costing a lot of money per day. So it basically is very useful for principal investigators or their office admins to track costs. And we are sending cost recovery invoices through FreshBooks, which is a payment processor and allows the researchers to pay the invoices through credit cards very easily, like a kind of PayPal. Docker is, as I said, an open source project, automated deployment of Linux applications. Instead of creating a virtual machine that runs on a physical server and has a lot of overhead, you can wrap your entire application in a Docker container. And it has everything. It has the code that you put there. It has the system proof binaries so you can run a Docker container that was built based on Red Hat or CentOS, Docker image, and you run it on top of Ubuntu VM. So if your application is dependent on a specific version, let's say, of CentOS, but you want to run it on a newer kernel that is only Ubuntu has it. That's the solution. And basically, it guarantees portability and repressability. So you can get somebody, your application, in a Docker container, and they can download the Docker container running inside their environments. It's going to run the same as in your environment, as opposed to giving them your application. And it runs differently here because they have some small differences in their environment, different libraries, different settings. As I said, it's a shipping container system for code. There are other options for running containers. The Docker is so far the choice for most developers. So we'll see if in the future other container technologies. It's not a replacement for virtual machines. It's an addition. There are places where you would want to use virtual machines, other places where you want to use Docker containers. It's good to know about both. And in order to decide, better decide which one fits the purpose better for you, in this case. The Docker containers are much more than full VMs. It's easier to share a Docker container than to share a full virtual machine. Because they are smaller, you can run more of them on a physical server. So if you run somewhere, have a physical server, you can run maybe 10 virtual machines. You can run maybe 100 containers. Because they share the kernel, and they have less overhead than virtual machines. They boot faster. So VM might be booting, I don't know, a minute, probably less, depends on the environment. The Docker container takes only seconds to start. Because they run closer to the hardware, they get better performance. Problem is that because they share the kernel, it's easier to escape a Docker container and to attack the host operating system or another Docker container. So usually, they run inside virtual machines, where the host has better segregation between virtual machines. Unless you run all your containers on a virtual machine, so then it's your Docker containers that can escape the jail and attack others. But in a shared environment, usually, you cannot run Docker containers directly on a physical server. So your Docker container cannot run next to his Docker container. But your Docker container or containers can run in a virtual machine on the same physical server as the other virtual machines that has its own Docker containers. And the same as the virtual machines, you wouldn't run Docker containers from untrusted sources because Docker containers are built basically on layers. So you don't know exactly what is in there. So you just don't download Docker container from the internet and execute it. Because inside there, that might do things pretty bad. So you have to trust the source of what's in the Docker container that you are going to execute. The same with the virtual machine. You don't just download virtual machine and run it. You wouldn't do it with an executable program that you download from the internet. You might scan it with antivirus for Docker containers. There is a work in progress to actually scan them. But still, you rely on somebody else's due diligence to confirm that there is nothing malicious in the Docker container. It's an example of how they are different. So if you see here, there is a server layer, which is a physical server. Then the operating system of installing the server, Linux distribution usually. There is a hypervisor, usually KVM. It's an application running installed on top of the Ubuntu or CentOSR. And then you have virtual machines. And virtual machines, they have their own operating system, which is usually different than the host. And they have their own binaries and libraries. And then applications. And if you have three VRs, application runs one runs application A, the other one application A. And third one application B, you basically have a lot of things that are repeated. And they take space. And it's just overhead. With containers, you have the physical server. You have the operating system installed. You have the Docker engine. And the applications, if you have three versions of this application, it doesn't triplicate the file system occupied. Basically, they are like clones. So much more efficient in terms of space and how fast they perform versus virtual machines. Dockster is a project that is a part of a research grant in Collaboratory. And is a YCR developed Docker registry for bioinformatics. So it has 27 bioinformatics tools, even more probably today. And it allows researchers to package their workflows. Docker containers, publish them in Dockster with examples on how it can be used. It has additional tools to integrate with the CWL language. And it's a very actively project. So there are daily code commutes in the project. And my colleagues at YCR are very willing to work with new developers who are interested in contributing to Dockster. Docker Hub is the official registry for Docker containers. It's maintained by the Docker company. It's like a portal where you go and you search for Docker containers. You see like you want the Nginx Docker. You search for Nginx and you find 20 Docker containers built by different people based on Nginx. Or you want another application. You search, you have it. And you can pull the Docker container in your server and you can execute it. So you don't have to know how to install that application. Somebody already built a container that has the application already installed. You just have to pull the container and run it. There you go. You have your application up and running without having to know how to install it, configure it, and so on. And if you have any questions, I finished my presentation. And we can go into the lab section or I can answer any questions about what I discussed. Yes, Mr. Sivak. There are some questions about the cost share. Yeah. Can you just describe what it costs an academic lab to be part of this? And whether your Dockstore programs are free or part of the Dockster? Yeah, the Dockster programs are free. The containers are free to use. In terms of the cost of collaboratory, we are trying to be very cost-efficient. So to give you an example, one CPU core hour is $0.03 in collaboratory. So basically, if you have a VM with one core and use it for one hour, you pay $0.03. In Amazon, a VM with one core is pretty much useless in bioinformatics. You wouldn't do analysis with a one core VM. In the ICGC in Pico, we used usually starting from eight cores even larger VMs. So if you compare an eight core VM at collaboratory, it costs you in a month about $1.79. The same VM with similar characteristics in Amazon, it's about $3.50 US. So it's more than double to have an eight core VM with 50 gigs of RAM in AWS. If you have it running in collaboratory, it's going to be probably 40% of the cost. And you have the data here in collaboratory, the ICGC data. So there are other benefits. So your meat billing for an academic that might be a couple thousand dollars a year? It depends. It depends. If you have a very large analysis that uses a lot of resources. Yeah, so the reason I'm asking that is that just be starting this. We have no idea how much core use it would have. Well, usually you start slow. And I will do this in the second lab. I'll have a scale out exercise where you actually have a problem to solve. Have 100 samples to analyze. And you know roughly how long it takes. And we'll go to that lab. But yeah, you basically have to build some experience on how long it takes and what is the size it needs. And then you can forecast your cost based on so much it costs for one and a half so many as I'm overhead for someone that will have to be rescheduled. And then when comes time for publication, is there any requirements for the OICR or anything else? Not that I know of. Like to be nice to mention that you did analysis there. But yeah, what you do inside your virtual machines is totally out of our reach. Yes? So I want to ask you how do you address how much space is over which all other stuff they need? They have a better idea? Well, so basically myself and Jared are supporting Collaboratory. And we have the website where we provide all the information about Concertional Collaboratory and how to reach us. And usually when we sign up on a new project, there is a learning curve at the beginning where we have more support tickets asking questions. But we are also available for initial one-on-one meeting where we can offer recommendations, find out what the research is trying to achieve, how they plan to do it, and see if we can recommend, OK, this is how you should do it here, or that's how we might work better. But usually in our experience, researchers who use Sofar Collaboratory, the PIs have postdocs or students who are pretty knowledgeable. And it's mostly technical questions, pretty detailed. Yes, sir? So what are the costs if you need to add your own data? So the cost for your own data is 0.00067. It's on our portal per gigabyte per hour. So let's say you add one terabyte of data and you keep it for two weeks and then you delete it, OK? You would pay for that one terabyte. So we had researchers that had to bring their own data because they were working on data we didn't have. The problem with that is that you have to basically start some virtual machines, SSH into your virtual machines, pull the data from wherever you have it right now, OK? Download it through FTB, SCP, whatever, and then upload it to the object storage, OK? The download is going to be slow because you pull it from somewhere else over the internet. The upload is going to be fast because you upload it from some VMs in Collaboratory to object storage, which is also in Collaboratory, OK? But it's an extra step. So if you really have to work with data that you don't have, then you have to. But the greatest benefit is if you analyze the large data sets that we have already, the 547 terabytes right now of line reads for 2,600 samples, and VCF's results from the three workflow that run against ICGC data plus the EGA data, so there is a lot of data that can be used for research already there. And the cost for the object storage is a bit more than the S3 object storage because it's the same cost as we have for the EPS. So our block storage and object storage and images are priced the same way, OK? Anybody else? Yes? Confidentially, we are uploading patient's sample data. They are below the competition of the L-System. But does the system keep the competition of any complies with the requirements? Yeah, so basically the buckets that you create by default are private, unless you make them public, OK? The object storage is unavailable from outside the environment. So you cannot basically make a bucket public and somebody from China see the contents of the bucket. But somebody from inside the collaboratory will be able to see the contents of the bucket, OK? And it's up to the user to protect the data that they are interested with. So if you download some protected data from outside and you upload it to your own bucket and then misconfigure the bucket to be public and it's going to be seen by another researcher, that's your responsibility. But if the bucket that you create, you leave it in the default setting, which is private, then only you can see its contents. I'm sure this answers the question. I'm still a little bit confused about that. Share this for yourself, because if you're going to see if you have access to that, and usually they have to have that to the complex, and that's guaranteed on that side, or not. Just because, for example, as the KWS disabled is currently there, you're not allowed to upload anything on that, either using our own doctors or under any date. So the object storage, basically, the internal security sets in the system take care of protecting the project's data between other projects, like segregating the access. If you want additional guarantees, you can always upload encrypted data. The same with Amazon. Like if you don't trust Amazon to guarantee that the data that you put in a private bucket stays private from other projects, you can upload it encrypted. And then you are the only one who has the description keys. And of course, there are some projects that are very strict and they want even more guarantees, in which case, sometimes you cannot.