 everyone thank you for hanging in it's getting later in the day here but we still have some a lot of things to cover so I am happy to present Erdold Kozgun from Microsoft who is going to present a package demo for you today. Hi everyone welcome to our session first of all I would like to thank you to the the conference organization committee so and welcome to Seattle and this is a joint work with the one of the core engineers and bioconductor project Nitesh and we started our collaboration two years ago with the bioconductor team and finally we will see and we are going to see the outputs from our collaboration and I'm a data scientist at Microsoft genomics team under Microsoft research and we are covering the cloud-based genomic solutions for our partners stakeholders and today's agenda is I will start with what's the data science VM, what's the content of the data science VM and what's the meaning of extensions for the Azure data science VMs and I will share the the built-in features from the data science VM templates and I will follow the architectural design of the custom deployments and how can we use the existed ARM templates for the bioconductor users. One of the important part of this conversation is or these kind of conversations every single researcher or data scientist needs a scalable compute engines and you need to install tons of tools or tons of packages if you are using bioconductor and with these kind of approaches you can easily customize what you need and how can you share and how can you make it scalable and for the session content you can find the all the content and the video record that I did for you on the GitHub repo and you can find the architectural design and the ARM templates if you would like to deploy the custom bioconductor VM you can find ARM templates from the repo. I would like to pause a couple of seconds for the link. Yeah let's start with the data science VM so as a data scientist I usually prefer to use the virtual machines for my prototypes for my the initial analysis just a single VM but I need to use our Python or any other deep learning or machine learning libraries and in this data science VM on Azure all these tools are built in so let's say if you would like to use R you already have an R kernel you if you would like to use Python it's already there or Jupyter LAP, Jupyter HAP you have everything and the another advantage of using this VM template is we can create our own sub templates for the genomics data or sub data frames and you can easily continue to use the deep learning libraries that I'm going to deep dive or if you would like to use the BI tools like Power BI or any other BI tool that you would like to use it's already there and this is like a kind of package that you can use your data on your own premises or on cloud and you can deploy your own solutions and today I'm going to show you how can we add the bioconductor the dependencies the bioconductor installed the whole the specific package installations with the custom ARM templates. Usually I prefer to use Linux and the Linux distributions and but if you'd like to use the windows definitely you can do the same things same installations with the similar ARM templates on the on Azure today but today I'm going to focus on the Linux side and you can easily use the Spark notebooks or the Spark environments on the VM too with connection to your cloud environment you can use any Python libraries as I mentioned earlier and you can easily create your automation tasks with these kind of ARM templates therefore you can do the similar things on Windows but today we are going to focus on the Linux version and these are the features that I usually highlight on my talks this list is important because we have tons of features but sometimes we don't have let's say if you would like to collaborate on your RStudio you cannot do it on the data science VM but you can do it on your Jupyter Hub or JupyterLab you don't need to install the third-party tools but as the data scientist I prefer to use the TensorFlow and it's already there and if you'd like to store your variants the genomics variant information the SQL server is already installed in the Windows version and in the Linux version you can easily connect your SQL servers on cloud or other sign-ups this this list is important so after the call or session please feel free to reach out to me for the questions and the GPUs are important for the data science in genomics so I can create my own custom bioconductor ritual machine and if I would like to go to the tertiary analysis side I can easily use the NVIDIA drivers or PyTorch or TensorFlow and I know most of the data scientists in the world would like to combine the deep learning features with their existed pipelines the genomics pipelines and I think this is a great chance for the bioconductor users for the tertiary analysis part if we would like to talk about the Linux extensions it's a kind of feature that you can install anything that you need after the deployment the VM deployment phase this means and I'm going to explain it but you can deploy your ritual machine then while deploying right after the deployment the shell file or the installation file that you have can easily submit it to your deployment and you don't need to spend too much time you don't need to checklist or you check your checklist for the installation this is a kind of shell file that you can easily install and I will show the the Nites shell file for the bioconductor dependencies today there are tons of ways to deploy your VM or custom VMs you can use the CLI the other CLI or your command prompt on your Windows machine you can do tons of you know you can use tons of options for deploying your VM and this is the high-level architecture of the the custom bioconductor data science VM and you can find the ARM templates and all the installation files in the guitar people the first thing that I would like to highlight you need to JSON files and they are ready for your use because the templates the ARM templates is already on Azure therefore you don't need to customize them but you need to JSON file the second one is you need to upload your shell file to the blob storage or any FTPs or any place in the web because the deployment process needs to read your shell file from a specific location therefore I prefer to use the blob storage but I have my to JSON file and the shell file for the custom installation and I'm a user and I can use my command prompt or Azure CLI for use the for using the custom deployment commands and once the deployment finished I have my own custom VM with with using the shell file that I point out and I will have the Jupyter Hub or RStudio on my VM so it's like a service or app on my VM therefore it's not like hey I created my VM and how can I you know access to this VM you have two options you can use Jupyter Hub or RStudio this is the shell file that me test shared on the by conductor decor repo so it's really important for the by conductor users I'm a by conductor user too at the at the end and once you use this shell file you can easily you know check all the dependencies and continue to use your own package codes or the other pipelines that you have with by conductor I hope it's yeah the resolution is not best but it's good so the first file that I mentioned is the template JSON file so in this JSON file you can specify your the admin passwords or your security the network security names or any other arguments that you would like to add your template usually I don't prefer to change this template because it's already there but if you would like to customize your template you can do it too but these arguments are important you can change your admin passwords or authentication methods or if you would like to change the operating system you can do it from the template but we usually prefer to use the existed version so this is the template JSON as I mentioned you don't need to change it but this is one of the file that you need and the most important part of this template JSON file I highlighted the four different line the first one is as you know the virtual machines are always changing so it's a maintained virtual machine for those users but if you would like to use the different version different data science VM on different configuration you can change and customize your template JSON file and the first box red box represents the date of the data science VM version and the second one is the link for the data science VM image so you can use this JSON file for downloading the image from Azure it's public and the second box will represents the link of the image this is important because we will as a Microsoft probably we will launch different versions of data science VMs so the only thing that you need to do is change the template JSON link for the parameters perspective this is the thing that you can you need to know for customization if you remember that the shell file from Nitesh I just uploaded this shell file to my blob storage the storage account and I just edit the link of the shell file so if you change this link with another shell file you can easily install the other tools that you need or you can easily add the more than one different shell file for your deployment and the command to execute is the another important part of the template this is like a regular Linux command so the sudo bash install the bio see dependencies shell file so this is like a basic VM Linux VM and you can type it on your terminal but in this custom deployment you don't need to do it you just need to add this command to your template JSON file any question until now so this means you need a shell file you need a template JSON and the parameters JSON file and it's already there on the GitHub the parameters JSON file is the is the key of your apps the built-in apps because in Linux or Windows VMs you need to access your VM via RStudio via Jupyter Hub JupyterLab therefore you need to open a port in this example we just edit the SSH port Jupyter Hub JupyterLab and RStudio server ports this means whenever I deployed my VM I need to check my IP for the VM then just edit the port of the VM to my deployment most of the researchers has some questions about the security so this is important because in this parameter JSON file you can open the ports and you can add more security features including the authentication method including the specific security network group on your cloud environment this means your data is in a secure place and your VM is running on a secure environment the question was yeah the question was how can we convince the IT guys to use these kind of deployments with not opening the ports to the public and store maybe save the access to specific users maybe for your lab so this is the answer isn't here the network security groups is the key words or the key phrase the every single lap or the team can have a network security group and only these persons or these colleagues can access this VM from this port because I will show it to you you will have a Azure Active Directory integration or the regular Jupyter Hub integration with the username passwords or two factor authentication the answer the short answer is network security group this phrase is the key and we solved tons of problem with different projects the federated learning is one of the you know the very popular example different groups work on the same project but they don't want to share their data so we can solve these kind of issues with the network security groups plus the federated learning options on the data science VM I hope it answered yeah yeah this is the parameters Jason the template Jason parameters Jason and the SH file these are the three files that I need so how can I deploy it so I will show in the record the video but this is the these are the comments for deploying the data science VM so first of all you need a subscription and I will show and share how can you get a subscription for research but the subscription is the first thing you need to show your subscription ID the resource group name you need to create a resource group for a specific user or user group and you can give a permission to these persons for accessing this VM the template URL location is important the template URL location parameters Jason location and the location for the VM so we have more than 60 6 0 regions so which region is the best for you so I know different countries like EU or some kind of countries in different countries as a different rules they cannot use the data center in different countries therefore you need to be sure that you deploy your VM on the right region in the right configuration and you can get the support from different other technicians to the virtual network ID this is another line that I needed to show and the admin password regular admin password virtual machine size you have tons of options including GPU and CPU machines or some confidential VM options there the network interface name the security group name virtual machine name and admin username so these are the basics for deploying your own VM you can remove the parameters from there or arguments from there or you can add more we have more than 50 different arguments for the data science VM deployment but these are the default ones that I'm using for deploying custom by conductor VM so and another question probably on your mind is which virtual machine family is the best for by a conductor so there's no single answer but I usually prefer to use the D series for my you know analysis on with the by conductor packages or if I would like to go with the GPU the GPU based tertiary analysis I prefer to use the N series so you can find all these details in the link that I shared but the important part is you don't need to develop or deploy a VM with highest configuration maybe you don't need a 146 GB memory or maybe you don't need a two terabytes of disk I joined different workshops in this conference and from my you know observations people needs maximum one terabyte or two terabyte so why should I deploy or attach the four terabyte disk definitely this is you may need more maybe but the one of the advantages of using these kind of custom time places I don't need to attach the four terabyte five terabyte in each deployment I can customize it so whenever I need that more disk for my new data set I can easily attach this and I just need to add a new argument to my deployment so this means the storage is customizable definitely and the deployment is really easy for the users that doesn't have any experience with the cloud and another thing is as I mentioned earlier for my prototypes I just deployed one VM but whenever I need to scale up my analysis I need I should do that easily and with the custom data science VM images I can create my own scale set this means I can create 10 or 20 different VM with the same configuration and they are all connected connected because I can maybe I can share different VMs for different researchers and the scale set is one of the option that I can use for the custom data science VM deployment I will be happy to detail if you have any question at the end of the session but the scale set is one of the advantages of using custom by a conductor data science VM because I don't need to think about the scalability of my analysis and before skipping to the record maybe you don't have any other subscription maybe you don't have enough budget so we have different options for the researchers as a Microsoft research data scientist I own more that maybe more than 10 different open source project to our environment and we are trying to support these projects for contributing to their success and definitely publish paper with them create the joint workshops with them so this is one of the you know the important part on my perspective and you can visit the links in the presentation there is the other academic research grants that you can apply and if you have students that you would like that would like to maybe create a bio conductor data science VM so you can easily submit and forward the link in the black box for getting a grant for your students or maybe individual students can apply for it so I would like to by the way I needed to record the deployment the reason is I I cannot use my own PC but I cannot log in my Microsoft account from any PC that's the blocker for me therefore I recorded this video and I will just walk through it to you so please just you know ask your questions if you have any so this is the github repo as I mentioned earlier it's public and you can find the the short description of the architectural design you can find the the template json file and parameters json file with the sh file use and I edit the feature table for the Linux VMs and the deployment command for your use it's all it is it's ready to use actually and you will see it you just need to customize your subscription ID and the other parameters on the list in the very bottom of the page you can see the links for the template json file and the parameters json file you don't need to change them or customize them it's there you just need to change your subscription ID for using it as I explained in the presentation you can find the other arguments that you can change or customize like your question network security ID changes or network interface IDs or the version of the data science VM and if you would like to use or install other shell files you just need to change this line and in the parameters json file you can change your ports you can change your security network security IDs or groups or if you would like to change the specific configuration for our studio server or jupiter server you can change from there and the sh file that we used for this command is coming from the biocondacter core teams link thank you Natesh for preparing this for biocondacter users and I use it too so all these deployments coming from there so you can easily test it and for the sample command I shared the sample command in the text file you just need to copy this command and customize it on your text editor change your subscription ID or change your admin passwords the user names so this is the default and the the basic configuration and I just copy and paste this text file and open my command yeah open my cmd then paste it then that's all so before going to the next stage I would like to highlight oh okay it's not better but I would like to highlight a specific point you need to log in your azure account on your CLI or your command prompt so don't forget to log in so after just submit my deployment I can easily track what's going on on my deployment if you log into your azure portal you can easily see hey what's the progress on my deployment for network security groups or public IP address or the virtual machine selection phases so you can easily track this deployment from the azure portal this is good because I can easily share this portal with my colleagues and they can easily track the progress in the next phase if you remember the username and password on my commands I just need to use them for logging into my azure sorry logging into my jupiter hub and the important part is you need to know your IP from your azure portal and the port is 800 in this case and you just need to know your password and username on the commands that I shared and then once you enter your username and password you just need to log in and sign in and the jupiter lab is coming from jupiter hub and this is the initial page for the jupiter hub if you remember that we have the shell file and we can install the extra packages easily and this VM is running the jupiter hub in here and the julia notebooks azure mal notebooks python notebooks are notebooks with the our kernel and the other markdown files is already there you don't need to install anything but you can easily start with the terminal because this is a VM this is a Linux VM and I just would like to show this is the regular VM with the custom bio conductor installations and if you would like to test the installation from there just test installation with one of the our kernels my our notebooks I just click to my our notebook and as you can see the our kernel in the right side is running now and in the regular and classical data science VM we don't have any bio conductor installation so with this shell file it already installed everything like let's say the base bio conductor packages and the dependencies then I can easily install anything to my notebooks I'm using it for the workshops for the the specific courses because I can just create my own image and every single student will have the same VM and you can easily track the execution minutes or seconds or how many cores you are using for the python but in R you can just track the the kernel status the execution requests and another way to install the similar bio conductor packages I can easily open my terminal from there and just type R R is already there and yeah I use I can easily install the same packages with the terminal too I have two options let's say for checking the the deployment the terminal or our kernels and another thing that I would like to highlight is the application list if you remember the shell file that netesh created these there are different libraries that already are installed to my VM and you can check the application list the library list that we installed with this shell file this is good because I can easily prove that I don't have any issue for installing the external shell files and this is the list of the the libraries from the shell file and you can easily compare that from the video or from the templates that I shared with you and the last part is to check the specific packages that we installed with the bioconductors and it's coming soon yes I just compare the installed libraries with the existed one yes the final one is I can check which packages that I installed to my custom bioconductor data science VM and I just checked the name of the libraries and available on my VM yeah so I tested several times I can easily install any library that I need for my installation so this the all content that I have for today as I mentioned earlier please feel free to send an email to me or netesh definitely the main idea of this session is you can customize anything on the data science VMs and this is the one of the and the earliest example of custom bioconductor data science VM since I shared this these shell files and the comments I sold the traffic so many people are looking for these kind of built-in solutions and you will have all the libraries plus bioconductor on this deployment and thank you so we do have one question that came in on the chat what is the advantage of using DSVM over just a docker container of the application launched in a plain VM yeah that's a great question thank you who asked this question Alex so the cloud is not just a VM the cloud has storage accounts cloud has the servers or the databases so whenever you deploy this VM you have a secure connection between the servers VMs and storage accounts so therefore these kind once you install the docker image definitely you can use it but you need to work on the security issues with the the individual docker images but once you deploy this VM it's really easy to connect different services including the service storage accounts and we will have Terra soon Terra on Azure so whenever you will have Terra on Azure instance you can use the similar approach so I hope I can answer the question so I think there's a follow-up question that he asked what might be the answer is there a built-in supervisor or some other process recovery to bring the application back up if it errors out so there's a error message on the portal you can see the error message for the if you have any issues on the application the installation of the applications but usually I tested these shell files and I recommend the same thing to the researcher yeah I just test my shell file on my pc then use it but yes there is a error message and the definition of the error on the portal yeah there is a specific intervals that ah I'm sorry the question is maybe I can summarize your question so whenever I deployed a couple of VM and maybe I can use my data lose lose my data and how can I be sure that my data isn't secure and backed up is that the right definition okay so there's a different features on the data science VM you can define a specific intervals for backup your disk or you can easily install one of the Azure SDKs for doing that manually but usually I prefer to set up the automated backups for my VMs on different regions let's say I'm working on the UK but I would like to backup on my on the US servers so as a disaster recovery so there are different disaster recovery options for the VMs and we have it for all VMs actually not just for data science actually I'm not sure I maybe I need to check the features this feature ah question the Chris singular to write use yeah so I need to check this for the different versions actually I didn't test it earlier cool thank you thank you very much