 as far as I know, 3.30. And it's time to begin with our next session. I'd like to introduce Natesh Turaga of the core team. And now he is working at Dana-Farber and making major contributions to bioconductor, both in terms of the production of containers, the production of binary repositories that you can use to make your container-based computing go very smoothly. And now he's going to talk about bioconductor doctor images or multi-node parallel computing on the cloud. Thank you, Natesh. Thanks, Vince. Can everyone hear me OK? I don't know how to check on the folks who are joining us virtually, but blink twice if you can hear me. So I'll be talking today about multi-node parallel computing on the cloud using bioconductor doctor images. And first of all, I'd like to point you to the slides and the GitHub repo. So I'll just introduce the topic a little bit with some slides for about 30 minutes or so. And then we'll go into a demo where I'll launch an actual cluster. We'll do some toy parallel computing to show you guys how it's done. But interrupt me with any questions you have. It's supposed to be an interactive sort of session. You can't follow along, unfortunately, unless you have a Google Cloud account and a credit card in the system, so that's why we didn't make it a workshop. It's going to be just a long demo. I'll try to keep it interesting. And if you have any questions, again, I insist that you just stop me and ask me questions so that we can have a lively discussion. So I'll introduce the topic a little bit. So bioconductor started around 2000. And the data back then was fairly small. They started with microarrays. There's people here who can attest to the size of the data and the compute infrastructure better than ICANN. But the idea then was R is in-memory model computing. So you could use your local hardware to do all the analysis you wanted. Data size kept increasing. We got into HPC, virtualized infrastructure, virtual machines, and all sorts of different high-performance computing measures as the data size kept increasing. And then in the last six to eight years, it kept increasing more and more. And now we need a scalable solution because we can now predict that data size keeps increasing. So now we're going to use containers, cloud resources. So we pay for what we need on the cloud. So why do we need cloud and parallel computing? Firstly, because there's a lot of data. And the cost is also an important thing. We want to pay for what we use. I can't find my mouse. And the competition for high-performance computing resources and institutions is usually pretty bad. You have to fight with your colleagues. You can't get time on the compute node you want, so on and so forth. So let's think what a traditional parallel computing model is. So traditionally, you would log into a head node of sorts. And this is on your HPC. Did people use high-performance computing infrastructure? OK, a lot of people have. So there's a head node you log into. And you'd have to choose the size of the worker nodes you want. You deploy your job. It comes back to you. And there's a file system backing it all up. The biggest problem is you don't have admin rights on any of these machines. So if you are an R user like I am, you'd need to install new packages. You'd need to install new software, new bioinformatics software. And they're usually not available. So you have to beg your admin, hey, can you please install this? And he's going to say, I don't have time. And then you're going to compete with your coworkers. There's wait times because someone's going to take up the largest node, use all the compute resources. And then every worker node needs to have the same set of software installed. So the admin has to do something radical where he installs everything for you. He won't let you do it yourself. And I guess the biggest issue with it is even if you are able to surpass all of these things, it's still not scalable. That little data center they have going on in your cluster is limited to the number of machines they have. And that's it. Say you exceed that limit. Tough luck. Go find a new institution. So that's why we want to replace this with a cloud-based framework that's easily deployable and scalable on demand, theoretically to infinity. So there's many pieces to this puzzle. It's not just like, hey, we can just do this within the click of a button. There's like a few pieces you'd need to understand. So this little picture I made is just showing you guys the different pieces of the puzzle. So we're going to talk about containers. We're going to talk about cloud providers. We're going to talk about parallel computing backends in R. And we're going to talk about Kubernetes. So let's start with the most important thing. So we need to know how to do parallel computing in R. So this is the first piece of the puzzle. So in R and Bioconductor, let's talk about R first, like L apply is a very common way to do like map reduce kind of work in R. So the idea is you have a function and you have a list of something. You have a list of objects and you can use the same function on every element or every item in that list. That's pretty much it. And in Bioconductor, we have a package called Bioc parallel, which essentially does the same thing. And the function is called BP L apply. And the only difference here is there's an extra function, extra argument to the function called BP param. You don't need to pay any attention to BP redo and stuff. But you can change the type of parallel computing back end, which is being used when you use this BP L apply command. So just to delve into this a little bit more. So you apply a function to each element on the list or a vector x. And the idea is the return value is also a list of the same length of what you gave it as input. So that's kind of key. And you can change BP param for different back ends for parallel evaluation. So I'm going to talk about a few different back ends so you understand where I'm going with all of this next. So Bioc parallel has different back ends. These are just a few. The most basic ones are multi-core param and snow param. So I'll just highlight a couple of the differences here. So multi-core param uses a process called FORX. You don't need to understand how it works entirely. But for each parallel thread, there needs to be a complete duplication of whatever is happening on the master process with the shared environment. So all the objects and variables which you give need to be copied over, which is why it makes the computing really quick. But one of the major limitations is it's not available on Windows. On the other hand, like sockets, which is run by Snow Param, pretty much works on all the operating systems. But each thread runs separately without sharing objects and variables. So it needs to be passed by the master process like explicitly for this to work. So it's a little slower because there's like a little bit of a communication overhead between the different processes. And there's this new back end called Redis param introduced by Martin Morgan and Jeff Wang, which essentially works with a queue data structure called Redis. Redis does a lot more things, but we're just using this queue type message passing interface, which makes it easier. So I just took screenshots from my local machine for each of these different back ends. So you can see that all of them report that I have 14 possible workers left. That's because so that's happening here. That's because I have a 16 core machine. And the way Bio-C parallel gives you the number of workers is it's total number minus 2. So it'll give me 14, so 16 minus 2. So Redis param is open source. It's available in Bioconductor. It's a new package, which is in Bioconductor.develop now. It's an in-memory data structure. It stores some objects. And it's able to pass these objects to different workers in that data structure. It's essentially a queue. So whoever goes in first comes out first. It's fairly simple. There's a few packages in R which do this. So one is called Redux, which is in CRAN. And Redis param is in Bioconductor. So how do you use Redis param? So I'm going to give you guys a little bit of a demo. Let's see if this works. If I can find my mouse. Can I drag this over? I can't find my mouse. Oh, there we go. Is this screen large enough? Is the text large enough in font size? Yes? Great. So I'm going to just start an R process here just for the sake of demonstration. And I'm going to split my screen, start another process, split it one more time, start another R process, split it one more time. I'm going to start a Redis server. The Redis server is going to act as this message passing interface. Everything on the right-hand side, you're going to think of as R processes which are workers. And on the left-hand side here, you're going to think of this R process as the master or manager. So I'm going to say library Redis param. So I'm running Bioconnector 316. So Biocmanager 316. So I'm going to make a Redis param for us. So Redis param. And since this is the manager, I'm going to say is.worker equals false. And I'm going to give it a job name equals test. I need to give it quotes because it's a string. So you'll see that a Redis param is up. So it created an object. I haven't started it yet. I just created an object. So I'm going to start Redis param. So by default, it'll say BP on workers is 1,000, which doesn't really mean anything. But all you need to pay attention to BP is up. It's false. So now when I look at my Redis param object, after starting it, after doing BP start, is there's zero workers. But it's active. And these two are my workers. So I'm going to say library Redis param Redis param. If you guys have any questions, like ask me. So this is a worker, which I'm initiating. Is.worker equals true because this is a worker. And since we want it to be on the same job, so the way they talk to each other is, all the process talk to each other through a job name. So I'm going to say job name equals test. And I'm going to say BP start P. And that's going to happen. And now if I look at P on my master or manager node, it's going to say number of workers is 1. So Redis is smart enough to figure out when something comes up spontaneously. You don't have to do anything. I'm going to copy the same code and paste it in my second worker. And show you guys now the number of workers is 2. So just in the interest of time, so I have this GitHub repo with some demo scripts. I'm going to just take a function, which essentially does nothing. All it does is sleep for a second and gives us a process ID. And then we're going to do some parallel work. BP, L apply, 1 through 4. So we're going to do it four times. So each worker is going to split the job twice. We're going to run this function foo. And we're going to give it a back end, which is Redis per amp equals P. So you'll see that each of these processes has a different process ID. So both of them ran it twice. So to fact check that, we can stop RP, stop all. So that's the function to stop the workers from the manager. So if I say RP stop all P, it'll stop the workers. And now if I say assist.getPID, so this gives us 32244. And if I say assist.getPID here, 32317. Any questions so far? So this is like a very simple and basic demonstration of Redis per amp. That's pretty much it. Nothing else. Questions, comments? Does anyone have a chat for on the WebEx? Maybe we can see if the audience folks have any questions? Go ahead. Well, it's very nice. I just wonder if you can say anything about the amount of infrastructure needed on the manager and the workers so that Redis is usable. Is it a simple runtime? So you need Redis to be installed independently? Does bioconductor installation of Redis per amp take care of that for you? No, it doesn't. You'd have to install it separately. Jumping back into the slides because there are no questions. We're going to switch to the next piece of the puzzle, which is bioconductor containers. So bioconductor produces Docker containers. These are available on Docker Hub. They're called bioconductor underscore Docker. And we have many releases available as well. So I think everything from release 3.11 is available on the Docker containers for bioconductor. The latest release is release 3.15. And the develop image, which is bioconductor 3.16, is called develop. So if you wanted to use these Docker images, you would just say Docker pull, bioconductor, which is the organization name. Bioconductor underscore Docker, which is the image or container name. And then in these square parentheses, I'm just trying to demonstrate the different tags which are available for these images. If you wanted to use develop, you'd use develop. If you wanted to use a release version, you'd say all caps release underscore 3 and release number. The advantage of these Docker images are they have all the system dependencies, including Redis, which Vince just alluded to available on them already. So you wouldn't have to install anything separately. And all the bioconductor packages would install by default. I say 99%, but it's closer to like 99.7% or something. And the other huge advantage of using bioconductor Docker images are you get binary packages for free. So when you just say bioc manager install, you would get the latest binary packages of all bioconductor packages available to you. You wouldn't have to compile it on your local infrastructure. They're 7 to 8 times faster in terms of installation. Docker images are very cross-platform. So no matter what operating system you use, Windows, Mac, Linux, it's available to you. And you get RStudio on the front end. Or going forward, it's going to be called Posit, I believe. So that's like one of the second piece of the puzzle. So we've learned about parallel computing, some basics about that. We've learned about Docker containers, which bioconductor produces. And now we need to know what Kubernetes is. So we have these containers. When we deploy these containers on our cloud system or on multiple different nodes, we don't want to have to install everything separately. We want a piece of software which does all of that for us. So when we use Kubernetes, it orchestrates everything for us. So it'll launch the same Docker image on all of these different nodes that are available. It'll make sure these images don't suddenly... These containers don't suddenly break or fail. So it'll make sure everything is running smoothly. They fail safe in the event something goes down, it'll bring it back up. And you can do rolling updates. That's not that essential, but the idea is if you want to make a change to your image, you can make it on the fly. So Kubernetes essentially handles like deployment, scaling, management of any containerized application. So in our containerized application is this bioconductor Docker image. Any questions regarding this? So I'll just introduce some verbiage of like Kubernetes here. So a node, when I say a node, it's the actual virtual machine or the compute instance which is on the cloud. So this is a computer quite literally at a data center and that's going to do stuff for you. Apart, think of it as like one container, just to make it easy. It's the most basic deployable object in Kubernetes, but you don't need to really understand it. That was just think of it as like one container or like, yeah, one container is good. And a cluster, when I say a cluster, it's like your entire Kubernetes deployment on your cloud. A volume mount is stuff where you store things so you can mount a volume to any of your Docker images on your local machine or on the cloud and you can store things on it. So you can connect all of your pods to the same volume mount. And there's this piece of software called Helm which acts as sort of a package manager for Kubernetes application and K8S is the abbreviation for Kubernetes. So don't get confused because of that. So just to demonstrate like this, these new terms again. So there's a Kubernetes API which you use on your local machine on your command line and that the command is essentially cube CTL or cube control. That's the API which you or SDK which you install on your machine to manage your cluster. And once your cluster comes up on the cloud, so you have a cluster with three nodes here. So three literal machines, three computers which are running stuff on the cloud. And think of each pod here as one Docker container doing something different. So node three has like three containers doing stuff. Node two has two containers. Node one has one container and they can all talk to each other or not. They can be doing different things but this is the idea of a cluster. Questions? Okay, so I'll talk about this bio conductor Kubernetes Redis cluster now and this is like the general architecture of it. So let's think about it from the user point of view. So a user would start a Kubernetes cluster. He or she would install a Helm chart on this cluster. So we give you a Helm package of sorts with this application. Sorry, keep bumping into the mic. We give you a Helm package. So you would install it on this cluster which you start on the cloud of your choice. And you would connect to your manager node which essentially is running RStudio on the front end for you to interact with it. And you would just say BPL apply and send some jobs to your workers. Like you could choose the number of workers you deploy and you would just send some jobs to the workers and the way these jobs are sent is through Redis. So now let's talk about this cluster a little bit. So in this example you have like three nodes and you have a manager which is running R and RStudio and RStudio is your front end. So you have that as your application in your browser. And in the back end you have like five workers here and your manager is able to send five workers a job of whatever size. And you also have a Redis pod and an NFS pod which is essentially different software. NFS is like a network file system and all of your pods are connected to that file system. So that means that the underlying file system which each of your manager and worker sees the same. So they can share information. So there's no transmission of information from one place to another. So there's no like latency in like message passing. Object passing I should say. So this is the idea of a cluster and this little cloud is essentially indicating that all of this is on someone else's computer on the cloud. Questions, comments. All right. Can we check if there's any questions on the live chat? I can try and join it like social. Okay, great, thank you. So next is the cloud provider. So you can launch this on Google Cloud, Microsoft or AWS and this Helm chart is available at this location here. And this is work by Alex Mahmoud who's at the back of the room. He's giving a demo I believe tomorrow. So be sure to attend it if you would like. Friday. He's giving the demo on Friday. So be sure to attend it on Friday. So the main advantages of using any of these cloud providers are they're on demand, customizable, they have like a mechanism to run your Kubernetes applications and so on and so forth. They're fairly advanced. Everyone is using it for everything now. Right, so we just get into the demo. This is gonna take a while so bear with me. I'm gonna launch everything live so that there's no difference between what I'm doing and what I'm showing you guys. So these are the requirements if you wanted to like run it yourself. So you would need a cloud provider for that. You would need a credit card or someone funding you. You'd need a Helm chart which we are providing. You need Kubernetes which is available on the cloud. You need Redis Param which is being provided by a bioconductor. So this is a user flow. So I'm gonna start the Kubernetes cluster on Google Cloud, the Google Kubernetes engine, GKE. And I'm going to install this biocudate configuration with the Helm chart. Connect to the manager. Deploy some R code using this BPL apply function with Redis Param, gather the results and delete the cluster. That's the goal at least. So now let's see if this demo works. So this is where the script is to run this stuff. So hopefully nothing bad happens. So I'll just stop all of these guys because we don't need them. These are all like local machine things but we are now going to the cloud. So I have a demo script. It's called multi-node cluster launch and I will put that up here so we can cheat a little bit. So first I need to log into my Google Cloud account and I'm gonna do that. Great. So now I'm authenticated with my Google Cloud account. And then I can start my Kubernetes cluster on Google Cloud. So for the sake of this demo I'm going to launch a cluster with two nodes of E2 medium size and that means it's like a four core machine so I'm going to launch two VMs with four cores. So now it'll launch the cluster which is gonna take a second or two, maybe three, four minutes actually. So if you have any questions now is the time to ask. Anything is fair game. This is gonna take a while, guys. That's why I asked for 90 minutes. My question is, you mentioned you have four cores on each machine. If you have a process that uses one core then you are wasting three cores. So can you use Redis to parallelize within the machines? Scale up and down as needed. And they share objects when they are running on the same machine and not share when they are running. So you can scale them. There's like commands in Kubernetes to help you scale but there's no auto scaling right now. That's what you meant by is it smart enough to shut down nodes and stuff? Was that your question? So it was more about the object sharing and environment sharing because if you have, I don't know, a huge surat object or whatever you don't want to copy it. The object doesn't get copied. You're working on the same file system underneath for all the workers and the manager so there's no copying happening. But it has to be in the memory, right? So it needs to duplicate in the memory. That's what I mean. That's what I'm saying. So you copy it once to this file system and it's all your nodes and your objects. Your object has access to like the manager and the workers. It's just happening once. Okay, thanks. I think it may be worth pointing out that there is another package that's relevant here called shared object. And I haven't used it but I think that maybe where you're going that you do not want to have replication if these machines are connected memory-wise and the shared object package which is by Jeff Wang should accomplish that for you. Natasha, have you done anything with that? I have never used shared object. So there's infrastructure to help with that. I don't think it's been exercised very heavily so if you want to discuss that. I think Alex has a comment. So with the network file system for each pod they basically think that the object that I think it gets serialized to the network file system and then if it's on the same node it will just have a cache on that node and just use it directly. If it's on a different node it will send it to the Kubernetes internal network from one to the other. But from the perspective of each pod it thinks that it's localized and the network file system just deals with actually moving the data from the user point of view that there isn't that much latency. Any more questions? Natasha from Nanostream. I did have some questions so this is kind of like a job manager. In some cases you're doing first and first out of queue. So how much job managing can you actually do with this? So it's not meant to be like you deploy a job. So it's more for interactive computing. So the way most people use R is you want to do something interactive on the fly. Even the BPL or fly command most people use it. You can use it in a workflow and let it sleep overnight and do a lot of things. I'm going to deploy a job and I want my results back now. That's the idea. I don't know if I answered your question. Yeah, no, that's okay. And then for BPL apply have you tried with actually running multiple different functions? Like what if I wanted to set up pods with different containers and run different things on all the different containers at the same time? You could do that. You'd have to design your own cluster though. But you could do that. Yeah, this is taking longer than usual because of the Wi-Fi. So I encourage more questions. Yeah, for sure. I have a question if that's okay. Aside from Google AWS and the other one that's escaping me, are there any other cloud service providers that you know of that are good or that aren't? Yeah, so I think IBM has its own cloud. There's DigitalOcean, there's Alibaba, there's OpenStack which is an open source. It's not free, but it's open source. And then you have like, I think NSF sponsored cloud as well like Jetstream, which are like for researchers, you'd apply with a grant or something and you'd get some service units as credits and you would use those. Yeah. Aside from all the benefits that are associated with using Kubernetes platform and the learning curve, what are some of the drawbacks do you think? So I think the biggest drawback is like, the expertise you have to develop and so if you remember my initial puzzle with different components you'd have to gain some expertise like be able to like launch and take down and avoid like excessive spending. So that would be one of the but Vince might have a better one. Yeah, I mean I think the whole problem of debugging in this context becomes very potentially confusing and difficult situation. That's why we are thinking about the idea of a development services so that you don't I mean so that bioconductor can help you experiment with these things in a very low cost or a zero cost environment and that's a very new set of facilities that we have and I think if we follow what Natesh is doing you see your first experiment and then I don't know if you ever are going to demonstrate how something can crash and what you do to recover but those are the kinds of things we need a lot more experience with. So yeah, that's where the expertise in Kubernetes comes into play and just being able to navigate like these cloud providers dashboards and stuff. Yeah, that's a big thing. But coming back to the demo we see that the cluster has launched it's called my GKE cluster and you'll see that it's launched in US East 1B the machine type is E2 medium the number of nodes are 2 and the status is running right now. We want to attach a disk to this so that we have access so all the nodes can all the nodes have access to the same disk or a persistent disk per se. So if you have some data you compute on your cloud service and you want to keep this disk after your computation because the data is too large you can just keep this disk available to you on the cloud for as long as you need so it's going to be there. You can take up and down the cluster you can create and delete the cluster but you can keep the persistent disk. So we'll do that um great so that's what happened just now I've selected a disk of under 200 gigs for IO performance so this is a demo you can do more or less depending on what you want and it'll give you once it's created it'll say hey this persistent disk has been created in this zone this is the size and the status is ready so once that's done you would actually you'd need to like get some credentials from the cloud to do something on your local machine to pass some commands and stuff so this is gcloud container clusters get credentials is saying that hey google cloud give me some credentials from this cluster so that I can run some commands and do stuff with it so it'll say fetching cluster endpoint authentication data and kube config is now ready so when I say kube ctl get all so it'll show you this cluster is active and it'll show you two nodes right so you'll see that the cluster is active and so I still need to launch my actual like helm application and this is what this does that I'll just walk through what each of these lines mean so the first line is saying like I don't think it's so the first line is saying like helm is the name of the command line software you're installing this application onto your cluster you're giving it a name it's called my redis demo and each workers so I'm going to set the number of workers in this cluster I'm choosing 3 and instead of 314 I'll say 315 I need to use a bioconductor image version so instead of develop so instead of release I can use like develop so I can say 316 here and 3 I guess the thing would be develop or I'm not sure actually so let's just use 315 for the sake of the demo and I created a persistent disk so everything in my cluster needs to be attached to the same disk so we're going to give it the name of the disk don't worry about that don't worry about that and this is the name of the application GKE helm chart demo and we'll remove the weight that is because I need to go back one so that's going to install this helm chart on my GKE cluster so it's going to take a minute so now if I say kubectl get all you'll see that it's creating all of these containers so what that means is each part is going to pull the correct container and run it on the cluster and this is also going to take a while so any more questions no that's a really good point yes so we can just go through the helm chart a little bit here so in this GitHub repo there's this GKE helm chart demo so if I click that you'll see templates and values so the template is essentially a template of the Kubernetes application and the values are like preconfigured values like default values but all of these are configurable of course so you'll see what exactly is happening in terms of like what's being run so these are pre-templated values and this is the config file for my manager deployment so I don't know if you guys are familiar with like these configuration files but the idea is we need to tell the manager part like what container to pull down so that's being given like you're giving an image name and a tag and you're giving it like an image pull policy which is like hey always give me the latest image so we'll just say always that's one of the options there and then you're going to tell it like what volumes need to be mounted like there's an NFS type persistent disk that's being mounted and RStudio is the application running on the front end so you need to give it a username and a password so this is the password which you can pass into it and the name of the value is called password it's not actually password so since it's a web application it needs to be hosted on a port of some sort so 8787 is going to be the port we choose and if you have anything extra you want to do like any extra commands that need to be deployed like run a command like you can just say R script E and give it some job and like just deploy it so that will be passed into this location here and it will run that job for you automatically and this restart policy says if it fails pull it up again like restart it it's pretty much the same so you will have like a config file for the manager you'll have a config file for your NFS you'll have config files for your Redis services our studio and your worker yeah this is where things get difficult for me and I'm just you've called this GKE HelmChart demo and it really seems like a significant product that code base sitting there with all those defaults and that entire logical structure you take things off essentially your Helm install you've overridden a lot of those parameters so is this regarded as a software product you call it demo but it seems to me that this has to be a kind of tested infrastructure with validation and so forth are you looking at it in that way so there's a package called BiosyCube install which is probably a little advanced for this section but in the inst folder of this package there is a HelmChart which is essentially the canonical one it does the exact same thing what we're doing now but the reason I pulled it out of here is to avoid the complexity of what this HelmChart is doing like just show folks a simpler HelmChart but I agree with you in the sense that it is a product that can be evaluated, tested and delivered in a better way I don't know if Alex can speak to that a little bit more like I mean we do have a bioconductor Helm we do have bioconductor Helm which does everything for a single node cluster and it'll teach you exactly how to launch stuff and you can launch it on your local machine on a minicube cluster so minicube is essentially doing whatever Kubernetes is doing on your local machine on Azure Google Kubernetes Engine and AWS Elastic Kubernetes Service and all the instructions that are in there Alex had a comment I think the chart that was made for biocube install is a really good concept and it has the NFS that is all of that deployed within that chart I think moving forward in order to not have us to maintain NFS that is we're going to try to bioconductor Helm chart put dependencies on NFS maintained by the Kubernetes communities that's the NFS Ganesha there's a chart that is maintained by Bitnami and try to use those as dependencies but kind of offload the maintenance requirement going forward for the other services so we would only maintain the bioconductor Helm chart and link it to the other ones and put the necessary linkage but not have to maintain the full stack so you would add dependencies so what Alex is essentially saying is like we do a lot of work on that Helm chart and we don't need to we can pass off all the maintenance of the Redis config NFS config to folks who are experts in that and just add dependencies in this YAML file here saying like Redis from here NFS from there and they would do stuff for us let's check on our cluster okay so our cluster is up and running right of course so if they can hear me and they have the slides they probably don't but if they let's see so if my I'm hoping my screen is being shared but like these are the links to GitHub take a screenshot or I don't have the chat open so guys take a screenshot in five four three two one so this is great so now that we have this available where's my example demo multi-node great so we have our Kubernetes application up and running and now the idea is I need to get my RStudio link so there's a very handy command here you don't actually need it but it's kind of handy but if you remember like the port it was 8787 and if you say kubectl get all your RStudio load balancer will assign an IP address to you an external IP address and it's going to be this one more likely than not we won't be able to access this right now because of the firewall but we'll try anywhere oh hey we can so the password I set is bioc and the username is RStudio hopefully we'll be able to log in so my backup is if it doesn't work I'll switch to my wireless hotspot on my phone but it seems to be working so the RStudio server is up and you'll see that bioc manager version as we chose it is 316 did we choose 316 ok we did and it's available to us so this is using a bioconductor container so if I wanted to install a package and I just say bioc manager install I would get a binary package now I wouldn't get anything any source package to compile so for example if I were to say RHTSlib it would just happen super quick there might be a few packages but I just wanted to show you guys a demonstration of the binary package installation give it a second I see let's check it out Vince is right we did set 315 ok yes so the workers are getting 315 but the manager is always like develop so as soon as you send the work it doesn't actually matter what version the manager is on the manager can be a different version the workers can be a different version but they'll still talk to each other but essentially thank you for pointing that out I forgot that little piece of information there but you'll see here on the installation that it just pulls a container from a binary store an object store on Google Cloud and it doesn't do anything it just unpacks it and it'll say installing binary package and it's done and doesn't actually compile anything so it's very fast so you made it seem as if it's ok that the manager has 3.16 but the workers have 3.15 is it if you had an object that was sort of a 3.16 dependent as 4 then it would have some problems so I think the check on consistency of bioconductive version should be a default as this thing goes forward right absolutely so I just wanted to like demo this multi pod or multi node parallel computing on the cloud so we have 2 nodes I think 3 workers so I have this demo script so I'm going to say the number of workers is 3 here I'm going to name my job binary build because that's hard coded for whatever reason and I'm going to run this like 13 times over one worker so that's going to be so one worker is going to run it 5 times another worker is going to run it 5 times and the next worker is going to run it 3 times and we're going to do a BPL apply after starting Redis param everyone with me this is exactly like the Redis param we ran earlier so we're going to load the library Redis param Redis server is being run by a pod so we don't have to start a Redis server and this manager node and the worker nodes know already that Redis server is active so we're going to set some system variables to make sure that the right environment variables are being detected start Redis param and you'll see that it recognizes the number of workers as 3 because you're going to give it 3 workers it hasn't started up yet but BPL apply once you run it it'll automatically start it so if I were to say BP start P it would know so that there's 3 workers available and then we're going to run the same function essentially which is 3 for a second and then give me the node name like it's not a process ID anymore it's the name of an actual machine which is running so that's we're going to call it a fun function and then we're going to time it so that we know how long it slept so in 5 seconds we should have 3 workers report to us that's the idea we're going to give it an uneven number like an odd number of workers odd number of jobs so that we know that everything is happening correctly and we're going to time it with system.time and we're going to say BPL apply 1 through 13 and you'll see that the elapsed time is like 5 seconds 5.422 and if I were to say so res will look like a list and we don't want to look at that what we want to do is we want to list it and then tabulate it so that we can see how many seconds or like how many jobs each worker ran so you'll see that each work cluster so 2 of them run 5 times and 1 runs 3 times so that's the idea behind this I guess and to fact check that the names of the work clusters are correct so you can compare the names here to what Kubernetes gives you back and you'll see that they match this and I guess this are the same so you have launched the cluster you have done some work on it now I guess the next big step is to like take it down because if you don't you're going to keep getting build and that's the biggest thing so you have to delete your Kubernetes application first you say so helm status my redis demo you can say that and it'll show that it's up so you have to delete it now so you'll say helm delete it'll delete everything on the Kubernetes cluster you have to delete that persistent disk unless you want to retain the information on that disk but since we haven't done any useful computation we can delete it I'll say yes so that'll happen in the background and then you also have to delete these two compute instances or nodes so that your VMs aren't just running so that'll you can also do that with gcloud container clusters delete and it's all gone from the cloud now it's deleting and that's do I have anything else? yes I do have something else so I just wanted to talk about an example application which we've been using for a while now and I showed you a demonstration of these binary packages being delivered to you guys and that is one of the biggest applications of this helm chart so the way we do it is we compute a dependency graph of all the R packages in bioconductor and we make it a parallelizable process so in this depth first search graph if you see level 0 that's a package all the packages with node dependencies are level 0 level 1 is all the packages with like one dependency level 2 all the packages with 2 dependencies and so on and so forth so when we pass these packages to be built on the Kubernetes cluster we pass it based on the level and what happens is like the workers get one package every time you pass a level in so if we have 50 workers and we have like 500 level 0 packages it would give in 50 first as soon as one finishes it would get like 51, 52 and so on and so forth there's no latency there and then once level 0 is done level 1, 2, 3, so on and so forth it goes up to like level 17 I think and we use the same cluster architecture for it to work in the background on the cloud and we store these binaries on an object store in the Google cloud and we deliver it to folks so this is BiosyCube install package which I demo, which I showed you guys uses RedisParam works on the Google cloud so it essentially uses all the pieces of the puzzle I just demonstrated it has parallel computing Kubernetes cloud availability and cloud infrastructure you need to know something about that and you also need to know some Kubernetes I just want to give acknowledgments to like folks on the team here like Martin Morgan, Vince Jeff Wang, Marcel Ramos and Alex Mahmoud it takes a lot of people to get this sort of work done and that's it if you have any questions, ask me we still have like 25 minutes so questions, comments