 Yes. I'll just start. Hi, good evening everyone. I'm Paul. So we'll be doing Docker for reproducible research. And what we're going to cover today, we'll talk about reproducibility, which is the main reason why we're doing this exercise. Second, we'll do a little touch base on what Docker is, what containers are. So at least get everyone up to speed on container technology. And then after that, then we'll dive into the workshop that we posted yesterday in Meetup. Anyone who has tried the steps posted, otherwise we'll follow it later. So reproducibility, code and data should be assembled in a way so that if it runs on my machine, it also should run on another person's machine. Let's say in all these cases in later time, I have a certain analysis that I run in windows. For example, I did the visualization loosing some libraries. What happens currently is if I need some library, if we want to replicate this analysis and then this visualization on the nonprofit's premise, then of course if the nonprofit doesn't have the libraries that I have or they don't have the software that I have, I would need to install that software, install those libraries so that I can replicate the analysis. Sorry, I think you're going to extend the screen. I think you're going to extend the screen. Extended screen. Okay. It's a duplicate screen so what I can see here is the same. Good. Yeah. Where was I? Sorry. nonprofit's premise or nonprofit's office, they should be able to do exactly what I did and came up with the same results so that at least when someone asked how did you came up with this result so that they can try for themselves and then get the same result. And then similarly, when we onboard new volunteers in data time and then this machine is not set up for that particular project yet, we should be able to bring our new team member and then maybe two years down the road someone joined, it should be easy but currently what we're doing is set up the machine, install what the needed packages are. Sometimes it doesn't work because new versions are in so we have to tackle that for now. But we still value reproducibility so even though that's quite challenging we still do because we want to be able to replicate what we analyze in another person's machine. So this is one of the tenets in not only in data science but science in general to be able to reproduce our research, our results. And then currently what we do is to at least help in that perspective is we try as much to use open tools for reproducibility and we also version our codes, our scripts, Python scripts, our scripts and then we put it in GitHub so that at least when we try to replicate or rerun the analysis, we have the code, we have the version in GitHub to use that to replicate the results. But that is versioning the code and scripts perspective. What we're trying to do or experimenting now moving forward is not only we want to version the code and scripts that we are doing we want to also version the environments where our code. So apart from the Python scripts or our scripts that we are running we want to version the set up the libraries that were used the needed software and then the environment where the analysis is being run. So currently code and scripts but moving forward we want to explore to version the environments as well so that we will be able to help address the reproducibility better. Any questions so far in this aspect? No. So we have Docker installed we just copied the tools it files over which I'm guessing are the containers or something? Currently currently USB being passed around so you copy just copy everything for now because some of the installers are format some of the installers are for windows when we come later to the workshop during the installation you can install what is specific to your operating system. So apart from the three installers there are two big, big tarps? Correct I'll discuss that later because it takes a lot to copy should we copy them as well? Anyway as we progress through the talk we have time so just proceed and copy it. That is a backup in case the internet when we pull the images from the internet it's very slow then we can use that. Thank you. Okay moving on so Docker we're going to Docker is the technology or platform that we'll be using to help us version this kind of environment whatever I run in my windows machine later on this analysis or libraries that I'm using someone will be able to run on his Mac or in his Linux box as well. So what is Docker? Docker is a platform technology that enables us to do this. So how it is being built whether I'm running on Windows, Mac or Linux we can take this same solution and then run it on any other volunteers machine. So usually this part the Docker there are two kinds of Docker containers one is Docker for Linux and then Docker for Windows. The one that we are currently using is the one for Linux. That's why you see the penguin here which should represent the base image for Linux. So it's running on Linux image then we built on top our application such as Jupyter and R. Why Jupyter and R? This is only the first way because we want our volunteers to at least get that chunking on the container concept. So that in the future when we have the second way, third wave we might be able to maybe use our studio or any other future platforms like data products but at least we want to be in a state where our volunteers is able to know Docker pool Docker load and those kinds of basic concepts so that in the future it will be easier for us to push and share tools across the volunteers and nonprofits. Any questions so far in the Docker terminologies? So for the basic Docker concepts that we are usually using we have at least three. There are more advanced ones but we'll tackle them when we go to the workshop or later on when we dig dive. So at least three things that we want our volunteers to know at least we want you to know what is at container registry. Container registry is server that hosts our Docker images. So let's say I have a Python Jupyter image we host it in the cloud currently we're using KIO although Docker hub is another alternative for that. So once it's in the cloud we have let's say a version of our container with all the libraries with Python like pandas and then platly. We contained it in the container image put it in the registry so that let's say the version of this is version 101. Anyone who downloads this or pulls this into their machine will be able to use that tool with that set of libraries. And then later on let's say this volunteer use for his analysis or her analysis. Later on when we go to the nonprofit site we just have Docker there and then we'll pull the same image to the nonprofit site and then replicate whatever was analyzed. So it's easier for us to do the versioning and pulling while we have it in the cloud. So the hosting that's called the container registry and then the file or the file system that we are pulling that is called the image and then once we have the image pulled in our machines the one that we're currently copying offline if we don't have the internet connection that one when we run the actual container or the actual instance that is called the container. So we have the registry who said in the cloud image the one we pull and then when we run it is called the container. Yup. Cool. Just feel free to interrupt me if there's any question. Knowing that for people who are not here they will be seeing this as our recording so it helps them as well. Yup. And then moving forward we can go through the workshop now. The workshop is the same as what we've posted in the meet up yesterday. So that contains the instructions for installation and then enabling you how to run those applications Jupyter Notebooks that we have posted. So what I'm going to do now I'm going to head up to GitHub the link is there we have a tiny URL data learn dapper or if you want to access the meet up dot com you'll have the which call is you'll have the link there as well. So if you go to GitHub the first one is installation. So in our GitHub called contain yourself repository we have three operating systems there that you can go Windows, Linux and Mac OS for those who are at home you can download from the internet but for those who are here if you have trouble downloading if it takes too long we are passing USB sticks around you can download the corresponding installer for your machine. For Windows we have two installers one is a docker toolbox and then the other one is docker for Windows. Docker for Windows is the sort of official version if your machine can support because it has other advanced features I think you need to have at least Windows 10 Pro to run it and at least Hyper-V but for those machines like this machine that I have it doesn't have Hyper-V so it doesn't meet the requirements I have to use another installer that is called docker toolbox I think it's in the same USB stick as well but if you go here in our GitHub you have the link there as well toolbox install windows if you go here you'll get the link to the docker toolbox so I'm not going to install it in my machine I already installed it just follow the instructions in the link and then have it installed for Linux I think you have upget when you install docker and for Mac you also have various options but you can use a brew anyone using Mac here so brew cast install docker so that's for the docker installation anyone here who hasn't or is in the process of installing docker or you have docker in your machines anyone who has docker already in their machines okay so we'll proceed the next one that we'll try is using python notebooks what we have here is first week we can try pulling from the repository one that we discussed which is the registry that's where we host our images that is called KIO that's why you can see KIO here and then to pull the image you just use this command so I'll just explain a bit this DKSG this is the organization in that registry to identify that these are the images that we host as data kind so later on in the future when we have other images you can find it under DKSG as well so for now the image that we want to pull is python 3 notebook and then if you notice after the colon we have version number so the idea of this version number is that let's say for example now we already have a baseline python notebook later on during the data dive let's say I really want to use this new package that is not currently in the image so how it goes is you can will be having library to student I'll just touch base a bit on this adding new libraries for example we can have the project curator to add that library if it's really needed for that project and then once that is added in the image will submit to github it automatically builds a new version in our KIO and then we will have a new number for the version so this one if the current version is 1.0.0 the next version will increment it 1.01 we want to encourage when we pull to use the version at the time although you can use without a version but we want to have this concept of reproducibility and ease of access to information that hey when we run this Jupyter notebook which version of the container did you run this from so that at least people who are going to replicate your work will know also the version so that they can pull the correct version on their side so once you when you pull this it will download the image and then once the image is done you'll see something like if you type docker images in your docker once you have it installed it will display all the images that you have and then once you pull this image from KIO you should see something like this one KIO, DK, SG, Python 3 notebook with the corresponding tag 100 to indicate the version so next once we have the version in our system we can move but apart from pulling from the registry let's say you don't have a good bandwidth at home because currently the python image that we have is around like 2GB so you need like a better bandwidth to pull that currently we're distributing USB sticks that contains the image so if we're not able to connect to the internet we should be able to load from a file and that file on how we can load from a file we can use this command called docker load so docker load input and then from the USB sticks then if you notice there are like two big very big files there with this file name so when we name the files later on we try to have a convention as well so that we know what to name it later in the in the when we load the image which is kio slash dksg slash python 3 notebook as well as the version number so we have the name as well as the version later when we load it so how does this go what I'm going to do because I already loaded the image I'm going to remove it for now so if if I need to remove it I have docker rmi this is the command to remove the image what I want to remove is this notebook python 3 notebook and then if you notice we have this repository tag and image ID I can just copy the unique part of the image ID copy that and then if I do this it should remove the python 3 notebook okay so it's deleted if I do docker image again docker image is the command to display all the images in your laptop or in your machine so as you can see there's no we have removed the python notebook already what I'll do is inside inside my directory sorry I'll go to this directory where I place the file this one is you can find this in your USB sticks as well I'm going to load this file the python 3 notebook I just need to go to that location so see see me can see I can see the notebook here already so how do we load the image to docker I'll just use this copy that because it's the same file name so it should be the same command as well so once I load it'll take depending on the anyone has already has the USB sticks copied to the machine or anyone who needs there's a step in the loading from a file instructions for step 3 it says sorry, step 4 docker tag with the loaded image ah, yeah that's good so welcome there in a short while so what we did just now so they're asking there's another step to or a command called docker tag where do we use docker tag for why do we need it it's because just now what we did is the command that we run is docker load and then the name of the image so the reason is when we load that with that command if you can see in a while by default it won't have the name or it won't have our name and it won't have the tag name so after we load it you will see I think null for the name or at least the name will be empty once you load so let me just show you so we've loaded the image so we'll see what image is loaded when we do docker images so when we do docker images you can see none none and then the corresponding image what docker tag enables us to do is to be able to name the corresponding image that we just loaded name and corresponding version and that's where the naming convention in our file file name comes in handy because what I can do now is I can just do a docker tag and then the loaded image and then have the corresponding let's say I have this I'll just tag then I want to tag this loaded image with the corresponding repository name and tag so what I'll do is docker tag and then the loaded image which is 00 D9 which is already a good because it's unique and then the corresponding tag so here it will just be the repository name I'll remove this extension the reason why I put slash there so that I can replace it later with actual slash why do we have this naming convention in the repository or a repository name later on when we discuss advance topics this it needs to be in this format so that we can push to the repository later on but for now we just follow the naming conventions like this so KIO which is our repository DKSG which is our organization in the repository and then the corresponding container image called Python 3 notebook and then the corresponding version which will use as our tag so once we have this and then we'll do docker images again once we have this then we have the corresponding repository name and tag so that's are we good with the tagging tagging is good so once we load the image there are two ways of getting the image right one is we pull from the docker repository the other is we load from the file but once we have it in our in docker we can see the repository sorry the image as well as the correct version that we have the next thing that we can do is try running the image we want to spawn Jupyter application and then do our analysis already so what and how we can do that is here running Jupyter notebook from the pulled or loaded image so I'll just explain this a bit so the command run this dash it we just usually use it if you want to do interactive terminal it stands for interactive and then t for terminal but if we just want to run the application like Jupyter we don't need this so I'll skip it for now and then I'll proceed with dash p dash p is the port so the left side represents the port of the host machine where we will run from the browser later and then the right side represents the port of the docker image so let's say when the application is run on the image it uses port 900 then if we map it to 800 in our local machine will access it from the 800 so it will be something like local host 800 so left side host host machine the right side is from the docker image and then the next one the volume let's say we want we downloaded we have an existing python notebook that is created by our fellow volunteer we want to be able to run that in our dockerized jupiter to see and view their analysis so what we do is we sort of share a folder or a directory so that when we launch docker later will be able to see those files that we're loading so v is for volume the left side of the colon goes through your local or host machines directory where you will put your python files for analysis or data files like csv and all and then on the right side right side is called home job yan work because I think this is the we based our jupiter image from an existing image called minimal jupiter which uses this folder so what we're doing is whatever we map in my users poll directory whatever directory is in my host machine will get map to home job yan work so whatever is in the home job yan work will be accessed when we launch jupiter in a while and then after that we have the image of the container that will be loading so what I'll do I'll just try to run the container so docker run and then what we have here port so I'll map my 8888 port to the 8888 of the container or image and then I'll map a volume also so that here one thing to note so this might be differing from operating system to operating system and then the installer as well specifically for docker toolbox I think that default docker toolbox needs you to share the folder from the C users directory otherwise unless there's another way when we tried it doesn't work if you share another one I think you need to configure the default that they're using is in users so when you share you share under docker toolbox but the rest of the I think for mac for mac I think you can share any other folder I think right or does it have to be a specific folder I think slash users also okay so for Linux Ubuntu you can just map any folder for windows there's a special folder if you're using docker toolbox which is C users anyway if there's any questions especially for those that are viewing this from home you can post your questions on meet up or maybe post in github as well so let me move on we're doing a run so I'll map I'll map my C users not the data learn folder just home and then and then the python image is zero zero d nine zero zero d nine so if I run that it's not it's having an error because I already ran it a while ago so that port from my computer which is 8888 is already used so what do I need to do because that port is already used I'll just assign another port because left side if you remember is from my host machine so if I cannot use 8888 I'll use 8889 for example and then run it so once you have this you can copy paste as this instruction is telling as you can copy paste the following link and run it in your browser so a while ago we changed our host to 8889 that means if I use 8888 I cannot access this one that I just run anymore so I need to use 8889 because this is my host so even this is telling me that 8888 is the one to use because this is running in the context of Docker so Docker is one thing or the container is one thing accessing from my host machine is another thing which is if I accessing from the host machine it works in the context of whatever I map it to which just now I map it to 8889 so if you do this okay so another point so this differs from operating system to operating system so the reason why I'm not able to just do a local host and then run the thing if you're running Docker for Windows this will work but if you're running Docker toolbox Docker toolbox this is a specific IP address that you can run your Docker on so how do we find out what IP address to use if you're using Docker toolbox for Windows please spawn another I'll spawn another CLI so here it says that Docker is configured to use default machine with IP this so what I'll do I'll use this and then replace the local host in my browser with that and then this one if you notice the default here is whatever I map a while ago in the volume so whatever I have here I'm able to see here now this one the image that we just load is a Python notebook so if I run a Python example it should run the needed libraries that we have here it has a set of libraries pre-loaded like NumPyPandas for those that we don't have and then let's say we need it in the data dive later then you can inform the library custodian or any one of us so that we can add that corresponding library so that when you pull next time that library will be included so here I can just run a Jupyter notebook and then you can view the results so before I move on because this this notebook is already in my local folder so some of the viewers who are viewing from home they won't have this file so I'll just follow what we have in the instructions in GitHub so in GitHub once you have the notebook and you want to try things out we've added a link there to Plotly which is one of the libraries for visualizations for both R and Python and some other languages so we'll just going to try their tutorial a bit so what I'll do here in our notebook I'll create a new one so I'll go to the existing tab I'll create a new one how do I on your upper right there's new and then Python 3 if I do this it will spawn me a new notebook so on top you have untitled if you want to name that notebook as data learn analysis for example I'll just click ok and then this data learn analysis since we have map a while ago our folder to this particular Jupiter instance whatever I create should be here right so at least let's say for example our container crash at least we still have the file so all we need to do is run the container again and then load the same director and then we have our analysis so I'll just proceed and then let's try some Jupiter stuff on this in Jupiter it enables us to do document our code as well as run them depending on the language that our current notebook is supporting the one that we just loaded is python notebook so we can run python instructions on it so what I'll do is for the next step is we can write either the code mark down or other notebook options but for now I'll just show you the mark down when we whenever we create an analysis notebook hopefully when we do the data dive we encourage to also write the version of the image because later on when we share this notebook to another person and then they want to run that notebook they want to be able to pull the correct container version so what I'll do here I'll use mark down to to do that I'll just put maybe an instruction title data learn and then maybe other one called image version so in the image version I can just put maybe what version did we use I can just put something like this so at least if we do or if we have this information in the notebook I'll just put shift and at least anyone who will be using this notebook that I'm currently using they'll be able to pull their docker image as well as the version knowing from the information that we have in a notebook already so we can put mark down instructions and put our analysis paragraph and explain what we are trying to analyze as well as we can do as well visualizations like using platly and those kinds let me just put some python commands so if you want to do python command you need to change back to code so why is it python this is python 3 so I need to do so the code that you'll be writing is going to run in the context of what language that image was built so if it's built on R you cannot run Julia on it you need to run in the context let's say if you run if it's running on python 3 python 2 commands might not work so you run it in that way hello world and then just to use some graphs we have grafing examples let's load first some of the libraries so this particular image that we have we already included pandas at least platly in this so this should be able to load if we didn't miss any library so that's gun well import data we'll try to import we're connected in the internet right so this read CSV should be able to work it's going to read the CSV we don't have an account for I'll just keep the current I'll move on to the next this is just to show at least our visualization keep platly for now I think it needs authentication something I'll copy from the other notebook so the other notebook I'll just load same thing we'll share the notebook in the in the meet up so at least people who want to try the ones that we are trying here will be able to try as well just wanted to show until we get the graph display the table and then once we have the table we'll share the notebook once we have it available this is for python so the objective is at least we get our volunteers be able to pull image ran the image and then do their analysis on that we have the same sort of instructions for R for using our notebook so it's the same thing we pull but it's a different image so we pull our notebook instead of python 3 and with the corresponding version when we pull this we should have it available in docker in docker images so this is the one when you pull it when you load the image from file is the same instruction or same step as what we did also for python and then once you load that you will of course when you load it it won't contain any repository name or tag that's why we need a tag in order for us to name it as our notebook as well as the tag version 101 and then similar thing loading from file just different image then running jupiter notebook from pulled image if we're running for R it's the same instruction the only difference is the image ID that we'll use instead of using the 00D which is for our python we use AC9 so these two are just jupiter notebooks jupiter notebooks is only one of the applications that we can run in docker in the future we might be able to run other applications maybe RStudio or maybe a data product of some sort but at least at this state we just want our volunteers to be able to know how to docker pool docker load docker tag at least to get them running in their machines at least jupiter for now so that's I think that's about it with regards to the workshop any questions so far anyone who is stuck somewhere in the steps no you're able to load until running the jupiter notebook which area or which step are you currently stuck okay by the way if anyone is stuck we have mentors around feel free to raise your hand yes current it's okay we can just a moment well I think that should be okay do we know the problem at least we can share for those who are also viewing if they it's basically having problems starting up basically creating an image and starting up okay so loading side there's an issue or something okay anyway for loading the image I think that should be okay loading the image should be the same as what we did a while ago so we load from tar file currently for those who are doing this from home we can try maybe sharing the image somewhere but anyway if we share it you have to download it anyway so you might as well do the this one docker pull or if you want to pull the Jupiter notebook use this yep for those who are doing this from home I think this is the end of the workshop let me check if we have any other things there no without data that should be okay if there's any question in the workshop you can just post in meet up as a comment or if you find any issue in the let's say the instructions that we have in the readme is not clear enough we need to be intentional in terms of the instruction just submit there should be an issue here or something there's no issue there should be an issue anyway just submit an issue sir not submit an issue here just submit a comment in the meet ups as I then will try to address that accordingly yep so next event will be on April April April 79 that is our data dive so hopefully what we learn here we can use it there to run Jupiter notebooks for analysis and thanks dot works for hosting us for today's data learn and see you all on the data dive yep